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This  document  is  the  final  report  of  work  performed  under  the  project  entitled 
'‘Design  and  Evaluation  of  Fault-Tolerant  VLSI/WSI  Processor  Arrays'*  supported  by 
the  Innovative  Science  and  Technology  Office  of  the  Strategic  Defense  Initiative  Organi¬ 
zation  and  administered  through  the  Office  of  Naval  Research  under  Contract  No. 
00014-85-k-0588.  With  the  concurrence  of  Dr.  Clifford  Lau,  the  Scientific  Officer  for 
this  project,  this  final  report  consists  of  reprints  of  publications  reporting  work  per¬ 
formed  under  the  project.  In  the  attached  list  of  publications,  items  1,  2,  3  and  7  are 
papers  where  fault-tolerant  systems  for  processor  arrays  are  proposed  and  studied.  Stu¬ 
dies  on  algorithmic  and  software  aspects  relevant  to  systems  are  reported  in  items  4,  5, 
8,  12  and  13.  Research  on  hardware  and  reconfigurability  issues  for  fault-tolerant  pro¬ 
cessor  arrays  is  reported  in  items  8,  9,  10  and  11.  , 
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Gracefully  Degradable  Processor  Arrays 
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Abstract  —  .\  new  approach  to  the  design  of  gracefully  de¬ 
gradable  processor  arrays  is  discussed.  Fault  tolerance  and  grace¬ 
ful  degra^tkm  are  achieved  by  simultaneously  reconfiguring  the 
processor  array  and  the  algorithm  in  execution.  Two  types  of 
algorithm  reconfigurability  are  considered,  namely,  row  recon¬ 
figurability  IRR)  and  row-column  reconfigurability  iRCR).  Cor- 
rcspondin^y,  two  array  reconfiguration  schemes  are  discussed, 
i.e.,  successive  row  elimination  ISRE)  and  alternate  row-column 
eUmination  lARCEi.  It  is  shown  that  the  computations  of  any 
algorithm  executable  in  a  processor  array  can  always  be 
lre)organized  so  that  the  resultant  algorithm  has  the  RR  and/or 
RCR  properties.  Upper  bounds  on  the  increase  in  execution  time 
of  an  algorithm  due  to  reorganization  of  computations  for  recon¬ 
figurability  are  derived.  Detailed  analysis  of  performance  and 
reliability  is  done  for  both  SRE  and  .4RCE  reconfiguration 
schemes.  These  reconfiguration  techniques  are  applicable  to  any 
processor  array  and  suitable  for  VLSI  technology. 

Index  Terms — .'Vlgorithm  transformations,  computational 
availability,  dynamic  reconfiguration,  graceful  degradation,  per- 
formability,  processor  arrays,  reliability. 

I,  l.NTRODLCTION 

A  recurrent  theme  in  the  quest  for  efficient  high-speed 
computing  systems  is  the  need  for  matching  the  struc¬ 
ture  of  algorithms  and  the  configuration  of  parallel  com¬ 
puters.  In  these  systems,  successful  fault  tolerance  and 
graceful  degradation  schemes  must  disturb  minimally  the 
conformability  of  algorithm  and  architecture.  These  brief 
ideas  underlie  this  paper's  approach  to  the  design  of  pro¬ 
cessor  arrays  where  graceful  degradation  is  achieved  by  si¬ 
multaneous  reconfiguration  of  algorithm  and  architecture. 
The  pervasive  consideration  in  the  efficient  use  of  processor 
arrays  ( 1 1-(  I0|  is  the  careful  development  of  algorithms  that 
allow  the  allocation  or  pipelining  of  data  and  instructions  so 
that  the  right  data  can  be  made  available  to  the  nght  processor 
at  the  right  time  by  using  the  limited  interconnection  capabili¬ 
ties  of  the  array.  In  ( 1  i-[  10|  the  reader  can  find  a  sample  of 
problems,  solutions,  and  expenence  on  mapping  algorithms 
into  processor  arrays.  In  these  works,  the  prevalent  real¬ 
ization  IS  that  dynamic  reallocation  of  data  and  instructions  is 
a  difficult  and  time-consuming  task.  .A  duality  argument  can 
be  used  to  claim  that  dynamic  reconfiguration  of  an  array 
with  fully  rearrangeable  interconnections  can  be  just  as  hard. 
In  tact,  at  a  more  abstract  level,  both  problems  reduce  to 
dynamically  achieving  "isomorphism"  between  algorithm 
and  architecture  .Any  component  failure  in  a  processor  array 
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(without  fault  tolerance)  is  a  potentially  disrupting  event  to 
this  ’isomorphism"  and  may  result  in  severe,  if  not  total, 
performance  loss.  Because  the  large  size  of  processor  arrays 
and  their  tasks  imply  a  high  probability  of  failure,  this  may 
become  an  important  limiting  factor  to  the  use  of  such  com¬ 
putational  machines. 

Redundancy  can  be  used  to  add  fault  tolerance  to  processor 
arrays,  i.e..  spare  components  are  added  to  the  system  and 
they  can  replace  faulty  units,  thus  preserving  the  original 
computing  structure  and  algonthm  mapping.  .An  alternative 
and  equivalent  way  of  thinking  about  redundancy  solutions 
consists  of  programming  algorithms  which  are  smaller  than 
those  requiring  the  use  of  the  full  array  and  sparing  out  extra 
unused  processors  |30).  .A  distinct  but  yet  related  benefit  of 
redundancy  is  the  possibility  of  improving  VLSI  array  fabri¬ 
cation  yields  [14l-[201.  and  several  redundancy  techniques 
used  for  this  purpose  are  potentially  applicable  to  fault- 
tolerant  array  computation.  For  a  entique  and  appraisal  of 
some  of  these  schemes  the  reader  is  referred  to  (14).  The 
amount  of  redundancy  used  in  a  system  is  limited  by  eco¬ 
nomical  and  technological  constraints  (e  g.,  in  [lb]  it  was 
observed  that  yield  improvement  saturates  above  10  percent 
of  redundancy),  and  the  minimization  of  redundancy  for  a 
given  fault  tolerance  level  is  an  important  research  problem 
{ll|.  Limited  redundancy  has  been  proposed  or  used  for  the 
MPP(31.  Illiac  IV  [12|.  CHiP(61.  and  Diogenes  arrays  [14], 
among  others.  The  main  observation  is  that,  by  definition, 
redundancy  solutions  still  require  a  fully  operational  replica 
of  theonginal  array,  i.e..  no  degradation  is  possible  .Alter¬ 
native  approaches  to  fault  tolerance  include  error-correction 
techniques  (21)  and  algonthm  rescheduling  strategies  [22]. 
The  former  explores  mathematical  properties  of  the  algo¬ 
rithm  and  IS  specialized  in  nature.  The  latter  is  algorithm  de¬ 
pendent  and  does  not  explore  the  possibility  of  limited  array 
reconfigurability.  An  algonthm  independent  approach  to 
fault  tolerance  oriented  towards  preserving  the  connectivity 
of  VLSI  multiprocessor  systems  has  also  been  reponed  [131 

In  this  paper  we  present  a  novel  approach  to  fault  tolerance 
and  graceful  degradation  in  array  processors  The  mam  idea 
IS  discussed  informally  in  Section  II.  It  consists  ot  using 
algonthm  mapping  strategies  and  simple  hardware  mecha¬ 
nisms  which  make  it  possible  to  preserve  conformability  of 
algorithm  and  architecture  despite  the  removal  ot  taultv  pro¬ 
cessors.  It  will  become  clear  that  our  approach  can  be  used 
together  with  previously  proposed  redundancy  solutions,  and 
such  "hybrid”  schemes  ibnetly  discussed  in  (he  last  section) 
would  have  the  advantages  of  both  approaches  Section  HI 
describes  in  a  formal  setting  ihe  theory  behind  the  two  mam 
algonthm  reconfiguration  strategies  Array  reconfiguration 
schemes  are  discussed  in  Section  IV  and  their  pertormance  is 
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scudied  in  Section  V.  Reliability,  pertormabiiity.  and  com¬ 
putational  availability  studies  of  our  techniques  are  presented 
in  Section  VI.  Section  VII  is  dedicated  to  conclusions. 

II.  Basic  Ideas 

We  consider  two  different  approaches  for  achieving  lim¬ 
ited  dynamic  reconfigurability  of  processor  arrays  and  algo¬ 
rithms.  The  first  approach  to  array  reconfiguration  consists  of 
logically  removing  rows  containing  one  or  more  faulty  pro¬ 
cessors.  and  is  referred  to  as  successive  row  elimination 
iSRE).  The  second  approach  consists  of  removing  either 
rows  or  columns  with  faulty  processors  (starting  with  rows), 
and  is  referred  to  as  alternate  row  column  elimination 
I ARCE  ).  Both  schemes  require  the  addition  of  programmable 
switches  and  interconnections  to  the  original  array  architec¬ 
ture  and  assume  that  ’penpherar'  processors  are  cyclically 
connected  via  ‘wrap-around"  links  dike  in  Illiac  IV  and 
.VIPP)  or  external  memory  (which  is  always  possible).  In 
correspondence  with  the  two  possible  array  reconfiguration 
schemes,  we  consider  two  algorithm  reconfiguration  strate¬ 
gies.  namely,  row  reconfigurability  (RR)  and  row-column 
reconfigurability  (RCR).  Interchanging  the  words  "row"  and 
"column"  in  the  definitions  of  SRE.  .ARCE.  RR.  and  RCR 
yields  the  dual  reconfiguration  schemes  SCE.  .ACRE.  CR. 
and  CRR.  Due  to  this  duality,  we  will  not  discuss  them.  The 
next  paragraph  introduces  informally  the  basic  ideas  on  algo¬ 
rithm  reconfigurability. 

Consider  an  algorithm  with  (T,  <  n,  <  no  computations 
which  is  executed  in  time  T„  in  an  array  with  (n,  x  n-j  pro¬ 
cessors.  To  each  computation  associate  the  time  of  execution 
and  the  coordinates  of  the  processor  where  it  is  executed, 
i.e..  index  each  computation  with  an  integer  vector 
I  s  t  <  r.,.  1  £  yi  s  n,.  I  sy.  <  n-.  The  resulting  index 
set  of  the  algorithm  is  geometrically  represented  in  Fig.  1(a). 
If.  during  execution,  data  move  in  a  direction  for  which  the 
value  of/i  does  not  decrease,  then  we  say  that  the  algorithm 
has  the  RR  propeny;  if  the  values  of and  j-_  do  not  de¬ 
crease.  then  the  algorithm  satisfies  the  RCR  property. 
Assume  that  our  algonthm  has  the  RR  property  and  due  to 
a  fault  we  remove  the  last  row  of  the  onginal  (n,  <  n-)  array. 
Then,  we  must  also  reconfigure  the  algorithm  for  exe¬ 
cution  in  an  d  n,  -  I)  <  1  array.  This  can  be  done  by  par¬ 

titioning  the  algorithm  into  two  "^ubalgorlthms "  or  ■'bands  ’ 
separated  by  the  plane  y,  =  n,  -  \  (Fig.  I(b)|.  First,  the 
reduced  array  executes  the  subalgorithm  for  which  I  s  /,  s 
n  -  1  and  the  RR  property  ensures  that  no  computations 
require  data  from  the  <)ther  band.  Next,  the  second  sub- 
aigorithm  is  executed,  possibly  using  data  generated  in  the 
previous  band  and  recycled  through  wrap-around  or  external 
memory  connections.  Note  that  potentially  slow  external 
memory  communication  can  be  done  concurrently  with  the 
execution  of  a  band.  In  tact,  data  generated  m  some  order  by 
a  band  will  He  used  by  the  next  band  in  the  same  order  i  i.e. . 
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Fig.  I .  Panitioned  algonilim  on  processor  anRys  of  four  sizes,  (allni  x  m) 
I  partitioning  not  required),  lb)  II  n  I  -  1)  x  n.).  ic)  ll/i,  -  I)  x  In.  -  1)). 
Id)  ll/li  -  d)  X  rt:). 


cographical  order  without  violating  data  dependences 
(Fig.  l(c)|.  It  is  important  to  compare  this  situation  with  the 
case  when  the  array  has  ((n  I  -  2)  x  n.)  processors  for  which 
the  algorithm  would  still  be  partitioned  in  only  two  bands 
(Fig.  1(d)).  The  remaining  considerations  on  data  commu¬ 
nication  are  similar  for  RR  and  RCR  with  the  exception  that 
RCR  may  require  additional  wrap-around  connections  or  ex¬ 
ternal  memory. 

The  ideas  underlying  our  approach  to  algorithm  recon¬ 
figurability  are  also  useful  for  the  problem  of  partitioning  an 
algorithm  for  execution  in  fixed-size  VlSI  array  architec¬ 
tures.  From  the  discussion  above,  it  is  clear  that  RCR  is  a 
sufficient  condition  for  algonthm  partitionability.  ([9], 
[  10).  (23)).  Similarly,  one  can  also  think  of  RR  as  a  sufficient 
condition  for  the  partitionability  of  an  algorithm  along  a 
single  direction. 

Not  all  algorithms  executed  in  processor  arrays  have  the 
RR  or  RCR  propeny.  However,  in  the  next  section  we  show 
that  for  any  such  algonthm  we  can  always  find  an  equivalent 
algorithm  which  satisfies  such  properties. 


111.  Algorithm  Reconfiglration  Schemes 
I  RR  AND  RCR) 

We  see  a  processor  array  as  a  two-dimensional  grid  in 
which  each  integer  point  is  a  vector  index  of  a  processor  and 
a  set  of  vectors  (the  interconnection  primitives)  which  de- 


FIFO  stacks  are  suitable  memory  structures  for  this  purpose). 
Similarly,  if  ihe  algorithm  has  the  RCR  property  and  an 
iin,  -  1)  X  In-  -  h I  array  IS  used,  then  the  algonthm  can 
be  partitioned  by  the  planesy.  =  n,  -  I  and  /;  =  ri:  -  1  into 
four  bands  which  can  be  executed  m  increasmij  lexi- 


scnbes  the  i  regular)  pattern  of  interconnections  of  the  array. 
Definition  it:  .A  processor  array  is  tuple  [L'.P)  where 


Z  jnd  /  denote  ihe  ■lets  or  integers  jnd  nonnegaiive  iniegers.  and  Z"  and  I’ 
Jende  their  respective  nth  Cartesian  powers 
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L'  C  Zms  the  irttie.tsEfi  ot  the  array  and  P  6  Z'-'"  is  a  matrix 
of  r  G  /  interconnection  primitives. 

Thus,  in  a  processor  array  lL^  P).  the  processor  with  index 
7  G  L'  is  connected  to  a  processor  with  index  7'  =7  ^  p . 
p  G  P,  if  7'  G  L-.  and  is  connected  to  an  input-output  port 
otherwise.  This  definition  does  not  account  for  ‘wrap¬ 
around"  external  connections,  which,  however,  are  assumed 
to  exist  between  input  and  output  ports. 

E.xampie  3.1 :  The  structure  of  orthogonal  arrays  like  the 
Illiac  IV,  .VIPP,  WAP.  and  others  can  be  described  by  (Z.^  P) 
where 

L-  =  0  S  s  .V  -  If 


I  :  01  - - -  i2  I ;  - - '(2  211 - '2  2: 
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where  .V  =  8  for  the  Illiac  IV,  .V  =  128  for  the  MPP.  and  .V 
IS  vanable  for  VLSI  arrays.  Fig.  2  shows  a  (4  <  4)  square 
orthogonal  array.  (End  of  example.) 

The  execution  of  an  algorithm  on  a  given  array  can  be 
thought  of  as  an  ordered  set  of  instantiations  of  the  array, 
each  of  which  contains  an  assignment  of  computations  to 
processors  at  a  panicular  time  of  execution.  Consequently, 
we  see  an  array  algonthm  as  a  three-dimensional  gnd  in 
which  each  integer  point  < jj'j-i/  indexes  a  computation  at 
time  ji  and  processor  i  j:j}y  .  and  a  set  of  vectors  (dependence 
vectors)  which  is  related  to  the  pattern  of  generation  and  use 
of  data  in  time  and  space.  In  other  words,  if  a  computation 
with  index  j  generates  a  value  used  in  computation  with  index 
j  .  then  j  -  ;  IS  a  dependence  vector.  Clearly,  the  first  entry 
of  any  dependence  vector  must  be  sl  (i.e  .  at  least  one  unit 
of  time  separates  generation  and  use  of  a  variable),  and  the 
vector  corresponding  to  the  other  two  -■ntries  must  corre¬ 
spond  to  a  linear  combination  of  interconnection  pnmitives 
li  e.,  a  path  connecting  the  processors  where  the  vanable  is 
generated  and  used).  .Assuming  that  communication  (over  a 
single  interconnection  primitive)  and  execution  of  a  com¬ 
putation  take  one  unit  of  time,  -  the  number  of  interconnection 
pnmitives  used  to  communicate  a  result  from  computation 
with  index  y  to  computation  with  index  j  ‘  must  also  be  less 
than  or  equal  to  the  first  entry  of  the  dependence  vector  j'  -  j 
(I.e..  the  interval  of  time  between  the  computations).  These 
considerations  translate  into  the  following  definition  of  array 
algorithm. 

Definition  3  2 :  Consider  an  array  (Lv  P  i,  P  ~Z'  .An 
arra\  algorithm  is  a  tuple  id ' .  D  )  where  d  ’  C  Z '  is  the  inde.x 
iCi  of  the  algorithm,  and  D  G  Z''""'  is  a  matrix  of  m  E.  I 
dependence  vectors  such  that 
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Fig.  2.  A'U  '  2 1  square  onhAigonal  array 


In  this  definition  of  array  algorithm  we  represent  oniv  the 
structure  of  the  algorithm  and  abstract  from  the  actu.i!  com¬ 
putations  being  performed.  This  is  adequate  because  we  are 
essentially  worried  with  problems  of  matching  compu’a'ional 
structures.  .Also,  input  and  output  data  are  not  explicitly 
represented  because  they  can  be  treated  as  generated  data 
(i.e..  for  a  given  processor  receiving  data  from  other  pro¬ 
cessors  there  is  no  distinction  between  data  generated  and 
data  "passed"  by  those  processors).  Finally,  the  uescription 
of  dependences  would  be  more  precise  if  to  a  given  de¬ 
pendence  vector  we  associate  the  index  point  where  the 
dependence  is  valid.  This  complication  turns  out  to  be  unnec¬ 
essary  for  the  derivation  of  our  mam  results. 

E.xampie  3.2:  In  l^l  the  following  computation  was 
performed  on  the  MPP  as  a  filtering  procedure  required  to 
avoid  nonlinear  instability  in  the  solution  of  Navier-Stokes 
equations 


i  1  S  .V. 


J  S  .V 


where  ,V  =  128  =  number  of  processors  along  one  dimen¬ 
sion  of  the  MPP  The  structure  of  the  MPP  is  described  in 
E.xampie  3  I .  This  computation  corresponds  to  an  .MPP  array 
aluonthm  because 


0  ()i(n  =  PK 
0  1  ,(  /  ) 


=  PK  tor  K  —  /  ,uch  that  ^  k.,  ss  J., . 

d\, 

I  ^  \  .  ■  .m  (2) 

'^Ve  follow  'he  jf>uji  jsbumption  ihai  m  one  unit  or  rime,  j  processor  can 
'cad  :he  output  ^egibters  ot  ncitfhtx)nn»5  pr(Keh>ors.  proceijs  data  it  necessary, 
mo  wnte  'exults  into  iti  nwn  iiutpui  regiiter^ 


I.)  ) )  0 


I  I)  0 


In  other  words,  this  algorithm  maps  trivially  into  the  MPP 
because  the  number  of  computatums  matches  the  number  ot 
processors  and  the  next-neighbor  communication  can  be  per¬ 
formed  m  one  unit  ot  iinie  End  ot  example  i 
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Example  J  J:  in  (71  it  was  shown  that  the  following  algo¬ 
rithm.  which  describes  a  simplified  version  of  a  standard 
relaxation  computation,  is  not  amenable  to  parallel  e.xecution 
(unless  transformed  as  descnbed  later  in  Example  3.-1); 

i/„  =  ii/:'. -  u.V-i,  -  It:  It. 4 

I  £  (  <  Z..  1  s  y  <  .V/.  1  S  i-  S  .V  . 

Here  we  note  that,  because  the  first  entnes  of  the  last  two 
dependence  vectors  in 

‘‘I  10  Old) 

O  =  .  -  I  0  1  0  j(  7  I 

0-1  0  [kk) 

are  smaller  than  1 .  this  algorithm  is  not  an  array  algorithm. 

I  End  of  example  . ) 

.As  illustrated  by  the  last  example,  not  all  algorithms  are 
array  algorithms.  In  other  words,  they  must  be  transformed, 
or  equivalently,  their  computations  must  be  reorganized  so 
that  an  equivalent  array  algorithm  is  obtained.  The  reor¬ 
ganization  of  an  algorithm  can  be  seen  as  permutation  of  its 
index  set.  and  hence,  it  can  be  descnbed  as  linear  trans¬ 
formation  r  £  Z''”"  such  that  T  is  nonsingular  and  T  =  (Jl 
where  t  £  Z"’  ’  is  referred  to  as  a  time  transformation 
and  S  E  Z'‘‘ is  called  a  space  transformation.  In  other  words, 
r  reorganizes  the  computations  so  that  a  computation  with 
index  j  IS  executed  at  time  t;  and  processor  5y  .  Due  to  the 
lineanty  of  this  type  of  transformation,  the  dependence  matnx 
of  a  transformed  algonthm  is  simply  TD  where  D  is  the  depen¬ 
dence  matnx  of  the  onginal  algonthm.  Hence.  T  can  be  selected 
so  that  the  new  dependence  matnx  makes  the  transformed 
algonthm  an  array  algonthm.  The  fact  that  ( 1)  must  hold  for 
the  new  matnx  ensures  that  data  dependences  are  not  violated 
ii.e..  T  yields  an  array  algonthm  which  is  equivalent  to  the 
onginal  onei.  This  type  of  transformation  was  introduced  in 
(27 1  where  T  is  denoted  R  and  referred  to  as  reindexing  trans¬ 
formation.  Subsequent  work  in  reindexing  transformations  is 
reported  in  (9),  ( 10).  (28).  (29)  and  their  references.  In  this 
paper,  we  add  to  the  knowledge  of  algonthm  transformations 
by  ^howlng  that  there  exists  always  .some  T  which  yields  an 
algonthm  with  RR  and  RCR  properties  (Theorem  3.1)  and 
how  to  denve  upper  bounds  in  the  execution  time  of  such 
aigonthms  iTheorem  3.2).  .Next,  we  illustrate  how  a  rein¬ 
dexing  transiormaiion  can  be  used  to  transform  an  algonthm 
into  an  array  algonthm 

Example  J  4  Assume  Z,  =  \l  =  .V  =  4  m  the  algorithm 
of  Example  3  3.  and  consider  the  array  >hown  in  Fig.  2. 
In  i'|.  the  transformation 

'2  1  r 

f'  =  1  0  I): 

_0  I)  Ij 

Aas  used  to  ontain  an  equivalent  algorithm  -.uitable  for  paral¬ 
lel  computation.  The  Nlightly  different  transformation 

'ti  '2  I  r 

T  =  S,  ,  =  0  I  ()  I 
5:^  _()  i)  1^ 


transforms  that  algorithm  into  an  array  algonthm  because  the 
resulting  dependence  matrix  is 

[I  111' 

TD  =  j  -  1  0  10 

[  0  -1  0  1. 

for  which  the  first  row  has  only  positive  entnes  and 

To  0  0  ol 


[0  1  0  Oj 

Fig.  3  shows  three  steps  of  the  execution  of  this  new  algo¬ 
rithm.  Empty  squares  represent  unused  processors.  The  index 
of  the  vanable  generated  is  shown  for  each  busy  processor. 
.Arrows  indicate  data  communication  required  for  the  compu¬ 
tations.  and  broken  lines  identify  computational  wavefronts. 
The  total  execution  time  is  n[L  -  l..Vf  -  l.iV  -  11'^-'- 
I  =  13  units  of  time.  (End  of  example.) 

The  problems  of  selecting  T  for  general  case  algorithms 
are  out  of  the  scope  of  this  paper,  and  the  interested  reader  is 
referred  to  (9).  Here,  we  concentrate  on  the  problem  of 
selecting  T  for  array  algorithms  so  that  the  transformed 
algorithm  satisfies  the  reconfigurability  properties,  i.e..  the 
RR  and  RCR  property.  These  properties  can  now  be  defined 
very  simply  in  terms  of  the  dependence  matnx  of  an  array 
algorithm. 

Definition  3.3:  .An  array  algorithm  (  Tf  D )  has  the 

RR  property  —  if  the  entries  of  the  second  row  of  D  are 
nonnegative; 

RCR  property  —  if  the  entnes  of  the  last  two  rows  of  D  are 
nonnegative. 

Example  3.5:  The  algorithm  of  Example  3.2  has  the 
RR  property,  but  does  not  have  the  RCR  property,  (End  of 
example.) 

Example  3.6:  The  algonthm  of  Example  3.4  does  not 
have  the  RR  and  RCR  propenies.  However,  if  T  is  rede¬ 
fined  us 

jTrl  r3  1  li 

r  =  'Sn  =  ;  1  0  01 . 

LO  0  'J 

the  resultant  algonthm  has  the  RR  property.  In  fact. 

'2  2  I  n 

TD  =  1  !  0  0  i  , 

0)  -1  0  Ij 

■ind  It  IS  interesting  to  note  that  communication  associated 
with  the  first  two  dependences  takes  two  units  of  time,  and 
lor  the  second  one  it  requires  the  use  of  two  interconnection 
primitives,  namely,  (1  ()f  and  (0  -  1 )'  Three  steps  of  the 

execution  of  this  algonthm  are  shown  in  Fig.  4  where  arrows 
indicating  data  communication  which  takes  two  units  of  time 
are  labeled  with  a  1 2)  The  execution  time  is  now  t(Z.  -  I . 
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Fig.  3  Three  steps  of  the  execution  of  the  algoruhm  of  Example  3  5 


Fig.  4.  Three  steps  of  the  execution  of  the  algorithm  of  Example  3  b 


M  -  l,S  -  1)''  I  =  16  units  of  time.  (End  ot' example.) 

Next  we  prove  that  we  can  always  reorganize  the  com¬ 
putations  of  any  array  algorithm  so  that  the  resultant  algo¬ 
rithm  has  the  RR  and  the  RCR  properties. 

Theorem  3.1:  Given  an  array  algorithm  rt'  it  is  always 
possible  to  reorganize  the  computations  of  A  so  that  the 
resultant  array  algorithm  has  the  RR  and  RCR  properties. 

Proof:  We  show  that  we  can  always  find  T  such  that  the 
new  dependence  matnx  has  no  negative  entries.  By  definition 
of  array  algorithm,  the  dependence  matrix  D  of  A  must  have 
only  positive  entries  m  the  first  row.  This  means  that  there 
exists  a  convex  set  which  contains  all  dependence  vectors  of 
A.  Hence,  using  convex  set  theory,  there  exists  an  infinite 
number  of  separating  hyperplanes.  In  other  words,  there  ex¬ 
ists  an  infinite  number  of  vectors  5  6  Z"'^’ such  thatSd  2O 
for  all  dED  Furthermore,  because  the  subspace  outside  the 
convex  set  is  not  degenerate,  from  that  set  of  vectors  one  can 
always  choose  three  linearly  independent  vectors  for  rows  of 
T  so  that  T  is  nonsingular.  It  is  also  clear  that  they  can  always 
be  chosen  so  that  the  new  dependence  matrix  satisfies  i2)  for 
some  K.  Hence,  the  new  algorithm  is  still  an  array  algorithm, 
and  Its  dependence  matnx  has  no  negative  entries,  thus  im¬ 
plying  that  the  RR  and  RCR  properties  hold.  Q  E  D 

The  next  theorem  provides  an  upper  bound  on  the  exe¬ 
cution  time  of  the  reconfigureu  algonthm  as  a  function  of  the 
execution  time  of  the  onginal  algonthm.  This  upper  bound  is 
valid  for  arrays  with  a  matnx  of  interconnection  pnmitives 
identical  to  that  of  Example  3  1.  In  other  words,  we  consider 


only  the  class  of  square  onhogonai  arrays.  The  methodology 
used  can  be  easily  applied  to  the  derivation  of  similar  bounds 
for  other  classes  of  arrays. 

Theorem  3. 2:  For  any  (n  x  n)  orthogonal  array  algorithm 
with  execution  time  To  there  exist  equivalent  orthogonal  array 
algorithms  with  the  RR  and  the  RCR  properties  and  with  execu¬ 
tion  time  T*  <  2Tl/n  +  3r,,  --  n  and  T*  <  2>Tl/n-  - 
9Tnln  +  bTo.  respectively. 

Proof:  Assume  that  the  original  orthogonal  array  algo¬ 
rithm  docs  not  have  the  RR  property  and  consider  the  worst 
case  possible,  i.e  ,  the  case  when  at  every  instant  of  time  all 
processors  and  interconnection  primitives  are  used  This  cor¬ 
responds  to  an  algonthm  for  which 

nil  1  i] 

D  =  io  I  0-1  0| . 

i_()  0  1  0  - 1  j 

and  It  can  be  transformed  into  an  equivalent  algonthm  with  the 
RR  property  by  choosing  T  as 

rTTj  [:  1  ()] 

T  =  i5i  I  =  i  1  0  Ol  so  that 

|_5:j  0  0  M 

1:3:1  :i 
TD  =  I  !  1  I  I 

!o  0  1  0  - 1 1 

Given  that  the  onginal  algorithm  has  iT,  <  n  <  n)  com- 
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putacions.  the  new  algorithm  requires 
iT[ro  /I  n]  -  fl-[l  1  1]  -  1  =  27-0 


_  T 


(3) 


units  ot  time  to  execute  in  an  array  with  Tn  rows  and  n 
columns  (due  to  the  values  of  5,  and  S;).  Given  that  the  actual 
array  has  only  n  rows,  we  must  panition  the  algorithm  into 
'Tn/n  <  To,jn  ~  1  bands  (i.e..  now  5  allocates  computa¬ 
tion  with  index  /  to  processor  [(5,7  mod  n)S-_j  T).  and  each 
band  takes  at  most  the  time  given  by  (3).  Hence,  the  total 
execution  time  of  the  new  algorithm  is 

T*  <  127,,  ~  n  -  l)(TJn  1)  <  27, Vn  37,  -  n  .  (4) 

Choosing  TT  =  (3001.5,  =  [110],  and  5;  =  [  10 1 1  leads  to  a 
similar  proof  for  the  RCR  case. 

Q.E.D. 

We  remark  that  the  upper  bounds  denved  in  the  last  theorem 
may  be  too  high,  and  how  to  find  (for  any  array)  a  trans¬ 
formation  7  which  yields  exact  upper  bounds  for  7*  is  an 
open  problem.  Nevertheless,  typically,  algorithms  executed  in 
(n  <  n)  arrays  have  execution  time  linear  in  n,  which  implies 
that,  according  to  Theorem  3.2.  reconfigured  algorithms  also 
have  linear  execution  time. 

Examples.  7:  The  algorithm  of  Example  3.4  does  not  have 
the  RR  and  RCR  properties  and  executes  in  13  units  of  time, 
whereas  the  equivalent  algorithm  in  Example  3.3  has  the  RR 
property  and  executes  in  16  units  of  time.  However, 
the  value  of  the  upper  bound  given  by  Theorem  3.2  is 
7*  <  n\in  T  37o  -  nlr,.u.,.4  =•  85.  This  discrepancy 
between  the  values  of  the  upper  bound  and  the  actual  execution 
time  IS  also  due  to  the  fact  that  the  algorithm  of  Example  3.4 
is  not  a  worst  case  algorithm  like  the  one  considered  in  the 
proof  of  Theorem  3.2.  This  is  easily  realized  from  Fig.  3, 
which  shows  that  not  all  processors  and  interconnection 
pnmitives  are  used  every  time.  (End  of  example.) 

IV.  .Array  Reconfiguration  Schemes  (SRE  and  ARCE) 

This  section  describes  the  architectural  features  of  pro¬ 
cessor  arrays  capable  of  SRE  and  ARCE  reconfiguration.  In 
both  cases,  limited  reconfigurability  is  achieved  by  using 
redundant  interconnections  and  switches.  The  figures  used  to 
describe  these  architectures  display  the  logic  organization 
and  functional  characteristics  of  the  architectures  and  their 
components.  Their  physical  layout  and  implementation  de¬ 
pend  on  the  technology  used  and  may  take  different  forms,  as 
discussed  in  [  14]-(20].  (26). 

A.  SRE  Reconfiguration 

The  basic  idea  of  SRE  is  as  follows:  if  a  fault  occurs  in 
processor  ii.yi.  then  eliminate  logically  the  ith  row  of  the 
array.  The  logical  elimination  of  a  row  is  done  by  setting 
identical  programmable  switches  to  certain  states  and  using 
redundant  interconnections  to  bypass  the  eliminated  row. 
Fig.  5(a)  shows  the  additional  interconnections  and  switches 
for  a  (4  -^4)  array.  The  broken  lines  represent  the  original 
hardware.  In  general,  in,  -  hn,  interconnections  i  =50  per¬ 
cent  interconnection  redundancy)  and  m,  -  1 )«:  switches 
are  required.  The  structure  and  possible  states  of  each  switch 
are  shown  m  Fig,  5(ci  We  ignore  the  need  for  additional 
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Fig.  5.  .Array  arclutecnire.  lai  SRE  reconfiguration,  (b)  ARCE 
reconfiguration,  (cl  States  and  structure  of  switclies  used. 


external  memory  due  to  further  panitioning  of  the  algorithm 
because  external  memory  is  usually  available  or  inexpensive 
to  add. 

The  reconfiguration  of  the  array  is  done  by  setting  the 
switches  to  certain  states.  Let  the  ith  row  of  switches  be  such 
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that  it  connects  the  ( i  -  1  Uh  row  and  ith  rows  of  processors. 
Let  X,  =  1  when  row  /  of  processors  is  eliminated  from  the 
array  because  it  contains  at  least  one  faulty  processor,  and 
X,  =  0  otherwise  (i.e..  row  i  is  present).  The  ith  row  of 
switches  has  its  state  determined  by  X,  and  X,  - , .  i.e. ,  each  and 
every  switch  in  row  i  is  in  state  .V,.|  X, .  It  is  easy  to  see  that 
the  above  rule  can  be  implemented  with  simple  logic  distrib¬ 
uted  across  every  row  or  every  processor.  Thus,  additional 
control  hardware  is  minimal  and  can  be  ignored  for  practical 
purposes.  Furthermore,  note  that  complete  isolation  of  faulty 
modules  is  provided  by  the  switches  and  the  control  rule  used. 

8.  ARC E  Reconfiguration 

In  describing  .ARCE  reconfiguration,  we  assume  without 
loss  of  generality  that  in  an  (n.  ^  .T;)  array  we  have  n,  —  n-. 
all  the  following  discussions  remain  valid  when  n.  2  /i,  if  we 
interchange  n,  and  /i:  and  replace  the  word  .ARCE  by  ACRE. 
ARCE  removes  either  the  row  or  the  column  of  the  array 
containing  a  faulty  processor  according  to  the  following  rule 
remove  a  column  if  and  only  if  n,/  nC  row  a  have  been  elimi¬ 
nated  after  the  last  column  elimination  (if  any)  Sole  that  for 
rt]  =  1  or  rt:  =  1,  .ARCE  reduces  to  SCE  and  SRE,  respec¬ 
tively.  .As  for  SRE,  logical  elimination  of  rows  or  columns 
uses  additional  switches  and  interconnections.  Fig.  5(b) 
shows  this  additional  hardware  (full  lines)  and  the  onginal 
array  (broken  lines)  with  (-t  ^  4)  processors.  In  general, 
(n,  1  )n-  -  ni(n;  1)  additional  interconnections 

(=•  100  percent  redundancy)  and  (n,  l)n-  -  n,(n:  -  1) 
switches  are  required.  .As  for  SRE.  additional  external 
memory  and  internal  control  logic  can  be  ignored.  The  struc¬ 
ture  and  states  of  the  switches  are  the  same  as  for  SRE 
[Fig.  5(c)l.  To  reconfigure  the  array,  the  scheme  used  for 
SRE  reconfiguration  is  also  used  in  .ARCE  for  row  elimi¬ 
nation.  For  column  elimination  the  rule  is  similar  except  that 
the  word  "row"  is  replaced  by  the  word  "column."  Thus, 
simple  distributed  logic  can  also  be  used,  and  total  fault 
isolation  is  also  guaranteed. 

V.  Performance  Analysis 

A  consequence  of  the  simultaneous  use  of  SRE  with  RR 
and  .ARCE  with  RCR  reconfiguration  is  graceful  perfor¬ 
mance  degradation  as  the  array  size  is  reduced.  In  this  section 
we  give  a  lower  bound  for  array  performance  as  a  function  of 
the  number  of  rows  and  columns  eliminated  The  lower 
bound  IS  exact  in  the  sense  that  there  are  worst  case  algo¬ 
rithms  for  which  such  lower  bound  performance  results 
-Assuming  worst  case  algonthms  for  which  the  RR  and  RCR 
properties  hold,  we  also  derive  exact  lower  bounds  lor  the 
performance  as  a  function  of  the  number  of  faults  for  the  best 
and  worst  case  fault  distributions.  The  best  case  fault  distri¬ 
butions  can  also  be  thought  of  as  the  case  when  an  ideal  array 
has  the  capability  of  reconfiguring  itself  so  that  faulty  pro¬ 
cessors  can  always  be  grouped  in  the  same  row  or  column. 
For  simplicity,  we  only  present  the  reasoning  leading  to  these 
bounds,  and  the  reader  can  use  simple  inductive  rules  to 
verify  their  correctness. 

Any  aigonthm  executed  in  an  m,  x  array  in  time  T,, 
will  perform  at  most  iT,  <  n,  <  computations.  Thus, 


i03<) 

the  worst  case  happens  when  the  algorithm  has  exactly 
(To  X  n,  X  ri:)  computations,  i.e..  all  processors  are  always 
busy.  We  consider  this  as  the  worst  case  in  the  sense  that  any 
single  processor  failure  results  in  the  largest  number  of  com¬ 
putations  that  are  not  performed.  Consider  the  case  when  this 
worst  case  aigonthm  must  be  executed  in  an  array  of  size 
(n,  -  ffi|)  X  ;n,  -  mj).  0  S  mi  <  n,,  ()  S  m-  <  n;.  This 
smaller  array  will  need  at  most  time  T„  to  compute  each  of  the 
"rtt/in,  -  mi)'’n:/(n.  -  m.)' partitions  of  the  onginal  algo¬ 
rithm.  each  of  which  has  at  most  ,  x  m.  -  m,i  < 
i/i-  -  mO  computations  Taking  computation  time  as  a  mea¬ 
sure  of  performance,  the  performance  of  any  array  with 
m,  rows  and  columns  eliminated  is 

T  ■£  ni/(n,  -  m,  r  'n-/  ( n-.  -  m-i'T  i6i 

Let  r,  denote  the  computation  time  for  an  array  with  k  faulty 
processors.  Let  the  ratio  f/T,  be  a  normalized  measure  of 
performance  when  k  faults  occur  and  let  F,  dent.te  a  lower 
bound  on  scch  ratio  d  e  .  consider  worst  case  algorithms). 
Depending  on  the  distnbution  of  faults  in  the  array,  P,  can 
take  distinct  values  for  a  fixed  k.  For  SRE.  k  faults  cause  the 
elimination  of  at  most  k  and  at  least  'k/ n{  rows.  From  (6)  it 
follows  that 


—  (F  *)SRE  —  “ 

n 

'1' 

r’  't:  I 


For  .ARCE.  in  the  worst  case  fault  distribution. 


|T) 


k 

rows  and 

k 
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/I,  , 

^  1 
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columns 


are  removed.  In  the  best  case  fault  distnbution.  ihe  numbers 
of  rows  and  columns  removed  satisfy  complex  expressions, 
and  we  prefer  to  use  simpler  conserv  ativc  estimates.  Clearly, 
the  number  of  rows  removed  is  less  than  'k,  in^j\  and  the 
number  of  columns  removed  is  less  than  ,k  ''n,,  n,' 
Hence,  from  (6)  we  have 
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r  ''i 
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n,  -  k  ~ 
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1  *•' 

1  " 
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L  J 

Note  that  for  both  SRE  and  .ARCE  the  lower  bound  on  per¬ 
formance  reduction  li  e.,  for  the  worst  case  aigonthm  and 
worst  case  fault  distnbution)  is  always  less  than  or  equal  to 
0,5  for  k  2  1  Fig.  6  shows  these  bounds  as  functions  of  the 
number  of  faults  for  the  case  when  n,  ~  n-  =  n. 
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;bi 

Fig  6.  Worst  case  pertorniance  reduction,  lal  SRE  ibl  ARCE  as  a  tunction 
ot  the  number  of  faults  m  an  i  n  «  nt  array. 


Example  3 .2:  .Assume  that,  due  to  the  occurrence  of 
faults,  the  algonthm  of  Example  3  6  must  be  executed  by 
ai  a(3  <  A)  array,  andb)  a(2  ^  4)  array.  Fig.  7  shows  three 
steps  of  the  execution  of  the  algorithm  in  case  b).  The  total 
execution  time  is  2(  ir(  1 ,  ,Vf  -  1 ,  ,V  -  1 1  +  1 )  =  20  units  of 
time  where  t,  .V/,  and  tV  are  as  m  Example  3  6.  The  normal¬ 
ized  execution  time  or  performance  is  16/20  =  0  8.  which  is 
higher  than  the  value  of  0  5  predicted  by  (7).  The  same  can 
be  said  for  case  a),  and  full  details  can  be  found  in  [31 1.  This 
example  also  illustrates  the  execution  of  a  reconfigured  RR 
algonthm  in  an  SRE  reconfigured  array.  In  Fig.  7.  the  wrap¬ 
around  arrows  are  labeled  with  (5)  to  indicate  that  commu¬ 
nication  takes  5  units  of  time  This  illustrates  our  claim  that 
this  scheme  allows  the  potentially  time-consuming  use  of 
external  memory  for  recycling  data.  lEnd  of  example.) 

VI  A.nalisis  of  Reliability  Perfor.mability  \^D 
Computational  .Availability 

In  this  section  we  evaluate  SRE  and  ARCE  reconfiguration 
schemes  in  terms  of  the  following  commonly  used  measures. 

1 )  reliability  /?lf ),  i.e..  the  probability  of  no  array  failure 
in  the  interval  of  time  [O.tl; 

2)  performability  Perfffi,  n  [24],  i.e  .  the  probability  ihat 
the  array  pertorms  above  some  performance  level  B: 

3)  computational  availability  .4  it  i  [25].  i.e  .  the  expected 
value  of  the  computational  capacity  of  the  array  at  time  t  iin 
this  work,  computational  capacity  means  the  number  of  pro¬ 
cessors  that  have  not  been  removed  from  the  array). 

We  use  the  common  assumption  that  the  processors  of  the 
array  have  exponentially  distributed  failures,  i.e  .  the  proba¬ 
bility  that  a  processor  has  not  failed  before  time  i.  given  that 


it  was  initially  functional,  is  e  "  where  A  is  the  failure  rate. 
In  our  analysis  the  unit  interval  of  time  [0./|  is  such  that 
A/  =  1.  i.e..  we  use  a  failure  rate  that  is  normalized  with 
respect  to  the  time  unit  (e  g..  A  =  lO'"  failures/h.  t  = 
lO'  h).  The  assumption  of  exponentially  distributed  failures 
1$  also  convenient  because  it  will  allow  us  to  use  .Markov 
models  for  arrays  with  SRE  and  ARCE  reconfiguration 
(exponential  distributions  are  memory  less).  This  is  justifi¬ 
able  for  arrays  with  independent  processor  modules.  We  also 
consider  the  possibility  of  imperfect  reconfiguration  by  using 
a  parameter  c.  the  coverage,  which  is  the  conditional  proba¬ 
bility  of  a  successful  array  reconfiguration  given  that  a  failure 
has  occurred.  This  parameter  incorporates  the  effectiveness 
of  both  the  fault-detection  and  switching  mechanisms  used 
and  we  assume  it  to  be  the  same  for  any  fault  in  any  processor. 
Next,  we  give  the  number  of  possible  array  configurations 
(also  called  degradation  states,  or  simply  states)  for  SRE  and 
ARCE  schemes.  This  number  also  measures  the  number  of 
faults  that  such  schemes  sustain  before  total  array  failure 
1  assuming  worst  case  fault  distribution). 

The  number  of  processor  failures  tolerated  in  an  (rti  x  «:) 
array  with  SRE  or  SCE  reconfiguration  is.  respectively. 


(rt)  —  1)  ^  (^)SRE  —  (tl| 

-  D/i, 

(9) 

n,  -  1)  <  (Af)scE  ^  (W! 

-  1  )rti . 

(10) 

The  left-  and  right-hand  sides  of  (9)  correspond  to  the  cases 
when  all  faults  occur  in  distinct  rows  and  in  the  same  rows, 
respectively.  This  comment  also  applies  to  ( 10)  if  we  replace 
the  word  “rows"  by  the  word  columns.  Clearly,  SRE  is  more 
fault  tolerant  than  SCE  if  and  only  if  n,  >  /i;.  This  is  a 
criterion  as  to  when  to  use  SCE  or  SRE. 

For  .ARCE.  the  number  of  processor  failures  tolerated  in  an 
(n,  X  n.)  array  is 


where  the  left-  and  right-hand  side  expressions  correspond  to 
the  case  when  all  faults  occur  in  distinct  rows  and  columns 
and  to  the  case  when  they  occur  in  the  rows  and  columns 
already  eliminated,  respectively 
Without  loss  of  generality,  we  consider  an  in  x  n)  array 
and  analyze  its  reliability,  performability.  and  computational 
availability  when  SRE  and  .ARCE  reconfiguration  is  used. 
The  Markov  state  diagrams  for  these  two  cases  arc  shown  in 
Fig  8.  The  number  of  degradation  states  is  given  by  the 
left-hand  side  of  (9)  and  ( 1 1 )  where  rii  =  n;  -  n.  i.e  .n  -  1 
for  SRE  and  2n  -  2  for  ARCE.  In  the  same  figures,  every 
state  IS  represented  by  a  circle  showing  the  state  number  and 
the  number  of  processors  in  the  reduced  array  for  that  state: 
the  state  transition  rates  are  also  indicated.  An  arrow  starting 
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Fig.  8.  Markov  stace  diagram,  (a)  SRE.  (bl  ARCE  reconfiguration. 


transition  rate  is  A,.|  =  n\,  whereas  for  ARCE  it  is  Ai,.;  = 
A.  The  Markov  models  for  SRE  and  ARCE  differ  only  in  the 
number  of  states,  and  they  are  described  by  the  differen¬ 
tial  equations 


dPrM) 

dt 

dPr>(t) 

dt 


=  -(cAo  f  (1  -  c)A(,)Pro(t)  =  -AoPro(t) 

=  -A»Prt(t)  +  cAt_|Pr»-iU).  k  =  1.  •  ■  •  ,D 

(14) 


where  D  =  n  -  1  for  SRE  and  D  =  2n  —  2  for  ARCE. 
Pr»lf !  denotes  the  probability  of  the  array  being  in  state  k  at 
time  t.  and  the  A  s  are  as  m  ( 12)  and  ( 13)  for  SRE  and  ARCE. 
respectively. 

.“Assuming  the  initial  conditions  Proiri  =  1.  Prt(O)  =  0, 
k  =  I .  .  Z),  the  solution  to  ( 14)  can  be  obtained  by  using 

Laplace  transforms  and  panial  fraction  expansion  as 


pTolf)  =  e 

« 

Ptilf)  =  c'wTlf* 


n.«ii  A. 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  C-i4.  NO.  II.  NOVEMBER  1985 

•  performability  =  PerfiB.t)  =  SU.iPrdt) 

where y  is  such  that  the  performance  P*  2  S  for  A  =  0.  ■  •  .y  ; 

•  computational  availability  =  A,(t)  =  — ?.oPtt(nCi 
where  D  is  as  above  and  C*  is  the  number  of  processors  in 
state  k.  i.e.. 


Ct  =  n(/i  —  k) 


for  SRE 


-  A,) 

« I 

.After  evaluating  the  state  probabilities,  the  reliability,  per- 
formability,  and  computational  availability  can  be  computed 
using  the  following  expressions: 

•  reliability  =/?(/)  =  S?.,)Pnlt) 
where  the  number  of  degradation  states  is  D  =  n  -  I  for 
SRE  and  D  -  In  -  1  for  .ARCE. 


C. 'I"  -  ;4,)("  -  i4j)  for  ARCE. 

We  can  also  compute  the  reliability  improvement  factor 
(RIF),  defined  as  R1F(/)  =  (1  -  P„(/))/(l  -  Rit)).  where 
R„(t)  IS  the  reliability  of  a  nonreconfigurable  array,  and 
thus.  Rj.t)  =  Pr„(/). 

We  used  the  above  expressions  to  evaluate  the  following 
arrays  using  SRE  and  ARCE:  a  (5  x  5)  array  for  the  cases 
when  (c  =  l.fl  =  0.5).  (c  =  l.B  =  0.25).  and  a  (10  x 
10)  array  for  the  cases  when  (c  =  1. 5  =  0.5).  (c  =  1. 
B  =  0.25)  (c  =  0.95.  B  =  0  5).  (c  =  0.98.  B  =  0.5).  and 
(c  =  0.99,5  =  0.5).  We  considered  the  operation  of  the 
array  during  five  intervals  of  0. 1  units  of  time  starting  at 
r  =  0.  The  results  summarized  in  Tables  I  and  II  allow  us  to 
conclude  the  following. 

•  ARCE  has  better  reliability  than  SCE  when  fault  cover¬ 
age  c  =  1  (sec  Table  1). 

•  ARCE  and  SRE  have  comparable  reliability  when  fault 

coverage  c  £  0  99  (see  Table  II).  . 

•  For  high  performance  (i.e. .  large  values  of  B ).  SRE  has 
better  performability  than  ARCE  (see  Table  1). 

•  ARCE  has  better  computational  availability  (see 
Table  1). 

•  Reliability  for  both  SRE  and  ARCE  degrades  signifi¬ 
cantly  as  coverage  decreases  1  see  Table  II);  this  is  particularly 
drastic  for  ARCE. 

Table  III  compares  SRE  and  ARCE  qualitatively  and 
the  following  comments  complement  the  information  of 
that  table. 

•  The  amount  of  additional  hardware  can  be  measured  in 
terms  of  the  increase  in  the  number  of  a  given  type  of  compo¬ 
nent  of  the  array  or  m  terms  of  the  chip  area  taken  by  that 
hardware.  The  first  approach  may  be  unrealistic  for  VLSI 
arrays,  whereas  the  second  is  dependent  on  the  technological 
process  and  the  size  of  each  ceil  used.  We  used  the  first  ap¬ 
proach  to  account  for  the  additional  number  of  inter¬ 
connections  and  the  second  to  measure  the  effect  of  adding 
switches.  We  assumed  that  every  switch  takes  20  percent  of 
the  area  of  a  cell  dike  in  120|). 

•  Fault  coverage  is  a  critical  parameter  for  the  reliability  of 
•ARCE  and  SRE.  Two  important  conclusions  can  be  made. 
First,  if  fault  coverage  is  not  very  good,  then  the  difference 
in  reliability  for  SRE  and  ARCE  is  negligible.  This,  in  turn, 
means  that  SRE  is  preferred  to  ARCE  because  it  requires  less 
additional  hardware.  The  second  conclusion  is  that  the  fault 
detection  and  recovery  schemes  used  should  be  as  simple  and 
reliable  as  possible  so  that  fault  coverage  is  very  close  to 
unity.  The  fact  that  our  reconfiguration  schemes  use  only  a 
reduced  number  of  interconnections,  switches  (which  can  be 
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TABLE  I 

Reliabiliti'.  Performability  for  Performaivce  Levels  of  8  =05  and 
8  »  0.:5.  and  Computational  availability  for  SRE  and  arcE  in 
iS  <  5ianoi10  X  10)  Arrays  (Coverage  t  =  li 
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TABLE  U 

Reliabilfty  Improveme-vt  Factor  rdr  SRE  \nd  aRCE  in  a  1 10  <  10) 
Array  for  Coverage  c  =  I ,  r  »  0  99.  l  =  o  9H.  \nd  c  =  0  95 
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Comparison  of  SRE  \nd  ARCE 
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designed  conservatively),  and  a  very  simple  control  rule 
matches  this  need.  This  may  not  be  true  or  feasible  in  fully 
reconfigurable  arrays. 

•  SRE  has  high  performability  for  high  performance 
levels,  whereas  ARCE  has  high  performability  for  low  per¬ 
formance  levels.  This  suggests  that  SRE  should  be  used 
in  real-time  applications  where  execution  time  is  critical, 
whereas  .ARCE  is  adequate  for  application.s  requiring  long 
periods  of  operation  (e  g  .  remote  Nvstemsi 

•  Computational  availability  is  used  here  to  measure  ihe 
potential  computational  capacity  of  the  array.  The  actual 
computational  capacity  depends  also  on  the  method  used  to 
explore  the  potential  capacity.  In  other  words,  there  may  be 
other  algorithm  partitioning  techniques  or  computation-to- 
processor  allocation  policies  that  yield  better  performability 
than  our  techniques.  In  particular,  note  that  (potential)  com¬ 
putational  availability  is  always  better  tor  ARCE  in  contrast 
to  worse  performability  for  high  performance  levels. 

Vlf.  CiONCLL'SrONS 

.Any  processor  array  can  be  made  fault  tolerant  by  using  the 
approach  described  in  this  paper.  However,  our  recon¬ 
figuration  schemes  assume  the  existence  of  complementary 


testing  and  fault  recovery  techniques  and  adequate  tech¬ 
nologically  feasible  hardware  implementations  Due  to  limi¬ 
tations  in  space  we  do  not  elaborate  on  these  issues  here,  and 
for  a  brief  discussion  lun  some  <)f  them  the  reader  is  referred 
to|3l|. 

In  summary,  this  paper  proposed  and  anuly7ed  two  pos¬ 
sible  approaches  to  the  design  ()t  gracefully  degradable  array 
processors  They  thrive  on  the  generality  anu  simplicity  of 
reconfiguration  schemes  which  make  it  possible  to  preserve 
the  conformability  oi  the  processor  arrav  and  the  aigorithm 
being  executed  We  showed  that  any  algorithm  can  be 
(remapped  into  a  processor  array  ao  that  it  ^an  be  partitioned 
or  reconfigured  along  one  or  both  orthogonal  directions  oi  the 
plane.  Arrav  reconfiguration  :s  achieved  by  ogical  elimi¬ 
nation  of  rows  and'or  colurr.ns  with  faulty  pri'ceASors  The 
switching  mechanism  isolates  tailed  module,  and  is  ex¬ 
tremely  simple  and  cost  effective  We  analysed  and  exem¬ 
plified  the  use  of  our  techniques  :n  detailed  ex.impies  Closed 
form  expressions  were  derived  for  reliability  pc rtormabiiity 
and  computational  availability,  and  they  were  used  lo  evalu¬ 
ate  i5  s  5)  and  i  10  s  lO)  ^pray  systems  Besides  its  sim¬ 
plicity  and  generality,  our  approach  has  another  significant 
advantage  over  previously  proposed  solutions.  .  e  .  the  pos¬ 
sibility  of  graceful  degradation.  Our  schemes  tolerate  at  least 
m  -  1 1  faults  in  an  I  n  '  u  i  array,  whereas  redundanev  solu¬ 
tions  tolerate  a  small  constant  number  ol  faults  and  require 
larger  amounts  of  additional  hardware  The  novelty  and  supe- 
nority  ot  our  schemes  results  from  the  tact  (hat  they  explore 
the  charactenstics  of  both  the  algorithm  and  the  architecture. 
Clearly,  our  approach  cun  be  used  together  with  other  solu¬ 
tions  based  on  the  use  ol  redundancy  -'r  more  complex  tonns 
ot  array  rcconiiguration.  In  these  hybrid  .cbemes,  redun¬ 
dancy  could  be  Used  to  preserve  the  'ize  and  structure  ot  the 
array  as  long  as  possible,  followed  bv  progressive  simple  and 
fast  SRE  or  ARCE  reconfiguration  steps  interspersed  with 
other  complex  time-consuming  reconfiguration  procedures. 
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I.  Introduction 

This  paper  describes  the  design  of  a  high-level  GaAs 
systolic  architecture  intended  for  use  in  a  new  generation 
of  advanced  communication  and  radar  systems.  The  pur¬ 
pose  of  the  systolic  array  is  to  provide  those  systems  with 
a  real  time  adaptive  filter  signal  processing  capability 
with  very  large  throughput  and  short  response  time.  The 
high  performance  characteristics  of  the  architecture 
described  here  result  from  the  use  of  fast  GaAs  technol¬ 
ogy,  an  efficient  and  numerically  stable  algorithm,  and  an 
innovative  systolic  array  architecture.  The  following  sec¬ 
tions  of  this  paper  describe  these  design  choices  and  how 
they  interplay  and  converge  into  a  solution  which  meets 
the  stringent  requirements  of  communication  and  radar 
sy  stems  of  the  1090's. 

Section  II  summarizes  the  basic  advantages  and 
disadvantages  of  GaAs.  Section  III  explains  in  more 
detail  the  adaptive  hiter  signal  processing  problem  as  it 
occurs  in  the  targeted  applications  and  describes  the  algo¬ 
rithm  used  for  its  solution.  The  global  systolic  architec¬ 
ture  is  described  in  section  IV,  as  well  as  the  design  of  the 
individual  processor  elements  of  the  array.  Section  V  is 
dedicated  to  considerations  on  fault-tolerance,  modularity 
and  extensibility,  and,  architectural  impact  of  GaAs  tech¬ 
nology.  Section  \T  is  dedicated  to  conclusions. 

n.  Why  G»A»r 

Gallium  Arsenide  (Ga/Vs)  technology  has  recently 
Nhown  rapid  increases  in  maturity  jlj.  In  particular,  the 
advances  made  in  digital  chip  complexity  have  been  enor¬ 
mous  This  progress  is  especially  evident  in  two  types  of 
chips,  static  R/VNls  and  gate  arrays.  In  1983,  static 
RAVfs  containing  IK  bits  were  announced.  One  year 
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later  4K-bil  versions  were  presented  .Several  conipaiiies 
were  working  on  8K-bit  designs  in  1985.  Gale  arrays 
have  advanced  from  a  1000-gate  design  presented  in  1981 
to  a  2000-gate  design  announced  in  1985.  With  this 
enormous  progress  underway,  it  is  now  appropriate  to 
consider  the  use  of  this  new  technology  in  the  implemen¬ 
tation  of  high-performance  systolic  arrays. 

GaAs  technology  generates  high  levels  of  enthusiasm 
primarily  because  of  two  advantages  it  enjoys  over  Sili¬ 
con.  These  are  higher  speed  and  greater  resistance  to 
adverse  environmental  conditions. 

GaAs  gates  switch  faster  then  Silicon  bipolar 
Transistor-Transistor  Logic  (TTI-)  gates  by  at  least  an 
order  of  magnitude  [2].  These  switching  speeds  are  even 
faster  than  those  attained  by  the  faster  Silicons,  CMOS 
and  bipolar  ECL  but  at  lower  power  levels  j2)  [3],  For 
this  reason,  Ga'Ks  is  seen  to  have  applications  in  com¬ 
puter  designs  in  several  computationally-intcnsive  areas. 
In  fact,  it  has  been  reported  that  the  Crav-3  will  contain 
GaAs  parts. 

GaAs  also  enjoys  greater  resistance  to  radiation  and 
temperature  variations  than  does  Silicon.  Ga/\s  success¬ 
fully  operates  in  radiation  levels  of  10  to  100  million 
RADs  [2].  Its  operating  temperature  range  extends  from 
-200  to  200  degrees  centigrade  [2].  Consequently,  GaAs 
has  created  great  excitement  in  the  military  and 
aerospace  markets. 

Unfortunately,  GaAs  is  also  characterized  by  some 
undesirable  properties.  Two  significant  areas  where  Gay\.s 
IS  inferior  to  Silicon  are  cost  and  transistor  count  cajiabil- 
ily. 

The  higher  cost  of  Ci;i.\s  chips  is  largely  the  result  of 
the  higher  cost  of  Ga.>\s  material  il.^elf  and  the  lower 
yield  of  Ga,‘\s  chips.  GaAs  inateruil  is  more  expensive 
than  Silicon.  Also,  since  GaAs  is  a  coinpoiind  material, 
additional  processing  is  required  to  create  it  and  to  verify 
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its  composition.  The  lower  GaAs  yield  is  also  due  to 
multiple  influences.  First,  although  improvements  are 
being  made  in  this  area,  GaAs  is  characterized  by  a 
higher  density  of  dislocations  than  Silicon.  Second,  in 
order  to  achieve  working  devices  with  adequate  noise 
margins,  very  fine  control  of  circuit  parameters  is 
required,  and  this  is  not  yet  easily  achieved  (2].  Finally, 
the  high  brittleness  of  GaAs  contributed  to  its  high  cost 
due  to  its  increased  breakage  |4|.  Currently,  GaAs  chips 
are  roughly  two  orders  of  magnitude  more  expensive  than 
their  Silicon  counterparts,  however,  this  difference  should 
narrow  to  possibly  one  order  of  magnitude  or  less  by  the 
end  of  this  decade. 

Transistor  count  limitations  of  GaAs  are  attributed 
to  both  yield  and  power  considerations.  The  relatively 
low  yield  of  GaAs  chips  forces  designers  to  consider  chips 
with  smaller  area  (therefore  lower  transistor  count)  in 
order  to  remain  cost-effective.  Although  GaAs  gates 
require  less  power  than  their  Silicon  counterparts  when 
operating  at  similar  speeds,  GaAs  gates  do  consume  con¬ 
siderably  more  power  than  slower  Silicon  MOS  gates. 
Because  of  the  thermal  management  problem  this  creates, 
fast  GaAs  chips  cannot  match  the  transistor  count  poten¬ 
tial  of  Silicon  chips. 

It  is  believed  that  these  four  Ga\s-Silicon  differences 
are  not  of  a  temporary  nature,  but  instead  result  from 
inherent  differences  between  GaAs  and  Silicon  materials. 
Conclusions  which  are  based  on  these  four  fundamental 
characteristics  will  remain  valid  even  as  GaAs  technology 
■natures. 

Because  of  these  GaAs-Silicon  differences,  it  is  not 
sufficient  to  merely  copy  existing  Silicon  designs  into 
GaAs  in  order  to  obtain  optimal  GaAs  performance.  The 
Ga,As  environment  presents  the  computer  architecture 
designer  a  new  set  of  challenges.  However,  the  rewards  of 
successfully  exploiting  this  new  environment  are  substan¬ 
tial.  With  the  high  speeds  which  characterize  GaAs  and 
the  recent  examples  of  GaAs  chips  with  VLSI  levels  of 
integration  (>10,000  transistors),  we  are  presently  on  the 
verge  of  achieving,  with  a  single-chip  processor,  speeds  for 
scalar  operations  typical  of  present-day  supercomputers. 


m.  Applications  and  Algorithms  for  Adaptive 
Filter  Signal  Processing 

Two  similar  adaptive  filter  signal  processing  applica¬ 
tions  exist  for  the  proposed  systolic  array  processor; 
adaptive  antenna  array  beamforming  and  adaptive 
doppler/spectral  filtering.  In  an  adaptive  antenna  array, 
the  phase  and  amplitude  of  the  waveform  incident  upon 


each  antenna  element  within  the  receiving  aperture  are 
adjusted  to  control  properties  of  the  far  field  antenna 
pattern  such  as  maximum  gain,  low  sidelobe  levels,  nar¬ 
row  mainbeam  and  pattern  nulls  in  the  angular  direction 
of  interfering  signals.  In  adaptive  doppler/spectral  filter¬ 
ing,  the  phase  and  amplitude  of  each  waveform  sample  in 
the  time  domain  are  adjusted  to  control  properties  of  a 
filter  in  the  frequency  domain  such  as  maximum  gain  at 
the  desired  frequency,  low  filter  sidelobe  levels,  narrow 
main  filter  response  and  sidelobe  nulls  at  the  frequencies 
of  interfering  signals.  In  both  cases,  the  amplitude  and 
phase  adjustments  (i.e.,  complex  weights)  are  determined 
by  processing  all  voltage  samples  in  real  time.  In  either 
case,  an  adaptive  Discrete  Fourier  Transform  filter 
operating  on  a  number,  N,  of  complex  voltage  samples 
may  be  used.  The  N  input  samples,  frequently  multiplied 
by  a  window  weighting  function  to  control  filter  sidelobe 
leveb  in  the  transform  plane,  form  a  complex  N- 
dimensioiial  vector,  x.  The  filter  output  is  formed  from 
the  product  of  the  weight  vector,  w,  and  the  signal  vec¬ 
tor,  X.  The  optimum  weight  vector  is 
w  =  R-'s*  =  M-'v* 

where  s  is  the  N-dimensional  steering  vector  defining  the 
antenna  direction  or  doppler  frequency  peak  response  and 
R  =  x’x‘  is  the  N  by  N  covariance  matrix  of  the  signal, 
whose  ij-th  component  is  rjj  =  x’xjV  M  and  v  are  res¬ 
caled  versions  of  R  and  s,  respectively  (for  full  detaib, 
the  reader  is  referred  to  |5)). 

The  adaptation  process  requires  the  inversion  of  an 
NxN  complex  matrix  in  real  time  or,  equivalently,  the 
solution  of  a  set  of  simultaneous  linear  equations.  This 
problem  has  been  one  of  the  main  concerns  in  numerical 
analysis  and  control  theory  for  many  years.  A  number  of 
algorithms  have  been  developed  and  studied  with  the 
adaptive  array  application  in  mind  [S-fl].  However,  for 
the  size  of  the  future  systems  for  both  communications 
and  radar  applications,  the  complexity  of  solving  the 
associated  equations  grows  rapidly,  implying  the  need  for 
utilizing  only  the  most  efficient  algorithms.  Tradeoffs 
between  hardware  complexity  and  convergence  time, 
maximization  of  signal-to-noise  ratio  and  minimization  of 
the  effect  of  error  sources  on  the  adaptive  process  must 
be  seriously  considered  for  each  application.  After 
analysis  and  simulation  of  the  candidate  algorithms,  one 
of  the  direct  Matrix  Square  Root  (MSR)  algorithms  pro¬ 
posed  in  (6|  was  selected  as  the  most  adequate. 

The  MSR  algorithms  involve  directly  updating  the 
sample  matrix  square  root  factors  U,  D  that  evolve  from 
a  Cholesky  factorization  of  the  positive  semi-definite  M 
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M  =  UDU"^ 


where  U  is  a  lower  triangular  matrix  with  unit  diagonal 
elements  and  D  is  a  diagonal  matrix  with  positive  or  zero 
diagonal  elements.  U  and  D  are  defined  as 

Mk  =  UKDKt'iI  =  v:  X,x7 

i=! 

and  are  recursively  updated  as 

U^DkU^  =  tK-,DK  ,1.:^.,  +  XKbxi? 

where  b  is  a  scalar  set  to  one  initially'.  The  matrix  inver¬ 
sion  needed  to  solve  for  the  optimal  weights,  Wq,  can  be 
reduced  to  a  single  back-substitution.  Distinct  MSR 
algorithms  differ  only  in  the  values  used  for  the  diagonal 
elements  u,;  and/or  d,.  The  chosen  MSR  algorithm 
results  from  letting  u„  -  1. 

rv.  The  Basic  Systolic  Architecture 

The  algorithm  lends  itself  to  a  systolic  array  realiza¬ 
tion.  The  triangular  structure  of  the  array  reflects  the 
matrix  triangularization  step,  characteristic  of  the  algo¬ 
rithm.  The  array  consists  of  a  triangular  grid  with  N(N- 

l)/2  nodes  or  elements  (N  being  the  size  of  the  original 
matrix)  With  regard  to  the  computation  involved  at 
each  of  them,  the  elements  are  of  two  types:  the  ele¬ 
ments  along  the  diagonal  and  the  others. 

Three  Systolic  Waves 

Assuming  initially  a  fully  systolic  realization,  the 
algorithm  iteration  requires  three  systolic  waves  of  com¬ 
putation  (they  are  shown  for  the  simple  case  of  N=6  in 
Figures  1  through  3): 

1)  Wave  1,  covariance  matrix  updating,  starts 
from  the  top-left  element  and  propagates 
toward  the  right  and  the  bottom;  the  data 
vector  X  enters  the  array  from  the  top, 

2)  Wave  2,  first  step  of  the  back  substitution, 
propagates,  as  wave  I,  from  the  top  left  ele¬ 
ment  and  generates  an  intermediate  vector  2 
used  next  to  compute  the  weights  w, 

3)  Wave  3,  second  step  of  the  back  substitu¬ 
tion,  propagates  backward  from  the  bottom 
element  of  the  array,  and  generates  the 
weights  sequentially. 


One  notes  that  waves  1  and  2  can  be  run  simultaneously, 
but  wave  3  must  wait  for  the  completion  of  the  previous 
two  to  start  backward.  Not  only  that,  wave  3  has  to 
operate  on  old  covariance  values;  as  an  example,  when 
wave  3  reaches  the  last  row  at  the  top,  it  expects  to  find 
the  L'  values  that  wire  there  when  the  data  vector  X,  for 
which  the  weights  are  now  being  computed,  first  entered 
the  array  This  requires  memory  within  the  array 

(liven  the  array  fully  populated  willi  processing  ele¬ 
ments  and  necessary  memory,  each  algorithm  iteration 
will  perform  two  passes  through  the  array  (Figure  1) 

-  The  first  pass  (including  waves  1  and  2) 
proceeds  from  the  top  and  left,  down  and  right 
at  a  Ih  degree  angle  The  data  vector  is  fed  in 
parallel  and  at  a  i'>  degree  angle  into  the  array, 
i.e.,  each  data  element  in  a  same  vector  enters 
the  array  one  cycle  after  the  input  of  the  previ¬ 
ous  data  element  in  the  same  vector.  The  pro¬ 
cess  then  proceeds  at  an  array  clock  interval 
determined  by  the  longest  compulation  lime  in 
any  array  element,  which,  in  this  case,  happen.s 
to  be  that  of  the  diagonal  element.  C'on- 
currently,  the  updated  covariance  matrix  U 
values  are  stored  into  each  cell,  as  in  a  shift 
register. 

-  the  second  pass  (generation  of  the  weights  w's) 
can  start  at  the  same  time  the  last  diagonal  ele¬ 
ment  D  is  processed  (same  clock);  now  the  com¬ 
pulation  proceeds  backward  a)so  at  a  45  degree 
angle  starling  from  the  last  column  and  last  row 
on  the  right.  At  each  clock  a  new  weight  is 
computed.  Note  in  Figure  4  the  uneven  but 
predictable  length  of  the  "shift  register”  in  each 
array  clement. 

From  Figure  4,  one  can  assess  easily  not  only  the 
memory  requirements  but  also  the  latency  time  (of  the 
order  of  2.\)  between  the  lime  a  new  data  vector  enters 
the  array  and  the  lime  the  hast  of  tlic  weights  w’s  is 
released. 

Figure  1  also  shows  the  organization  of  the  K's 
within  the  systolic  [iroccssor.  The  processor  i  leiiients  are 
airaiiged  as  a  right  triangle,  f  alciilal ions  within  the  tri¬ 
angle  ripple  from  lop  to  bottom  (root  covariaiue  ujidale 
and  first  step  of  back  substilulloii)  and  from  right  to  left 
(second  step  of  back  sufislilul  ion)  Data  flows 
liorizontally,  vertically,  and  downward  along  the  outside 
diagonal  One  important  feature  of  the  array  is  that  the 


autocovariance  values  remain  stationary  within  the  array, 
so  that  no  busses  are  required  to  transmit  them  to  other 
parts  of  the  array. 

The  array  is  constructed  of  two  cell  types,  which  are 
designated  S/VA-1  and  SAA-2.  The  SAA-1  cells  perform 
calculations  needed  by  the  root  covariance  update  and 
the  first  step  of  the  back  substitution. 

The  S/VA-2  cells  are  involved  in  both  the  root  covari¬ 
ance  updates  and  both  of  the  steps  of  back  substitution. 
The  root  covariance  values  are  kept  within  the  SAA-2 
cells.  For  purposes  of  pipelining  the  second  step  of  the 
back  substitution,  each  SAA-2  cell  contains  a  FIFO  regis¬ 
ter  which  delays  these  U-values  for  the  necessary  cycles. 


V.  The  Fault-Tolerant,  Expandable,  GaAa  Sys¬ 
tolic  Array 

Fault-tolerance  is  achieved  by  periodically  testing 
the  array  and  dynamically  reconfiguring  it  when  a  fault  is 
detected.  To  avoid  performance  degradation,  spare 
columns  and  rows  are  provided  to  allow  for  the  logical 
removal  of  faulty  processing  elements.  If  a  cell  in  row  i 
fails  then  both  row  i  and  column  i  are  bypassed  and  logi¬ 
cally  replaced  by  the  neighbor  column  and  row,  respec¬ 
tively.  Figure  5  show  the  basic  array  augmented  with 
spare  rows  and  columns  and  figure  6  illustrates  the 
reconfiguration  of  a  (128x128)  triangular  array  with  an 
extra  row  and  one  extra  column  and  one  faulty  processor. 
In  the  worst  case  fault  distribution  (i.e.,  ail  faults  occur  in 
dilfercnt  rows  and  columns),  up  to  K  faults  can  be 
tolerated  if  K  spare  columns  and  K  spare  rows  arc  pro¬ 
vided.  To  tolerate  K  worst-case  faults  in  a  system  for  N 
degrees  of  freedom  the  percentage  of  additional  hardware 

_  K(2N  -F  K-Fllri-  _i  , 

required  is  100  x  — — ; -  -c.  For  example,  for 

I\(N-1) 

.N  =  r2  and  K  =  l,  20*^0  redundancy  is  required.  On-cell 
multiplexers  set  by  the  array  control  system  are  used  to 
bypass  rows  and/or  columns  of  the  systolic  array.  Fault 
detection  is  done  by  interleaving  test  vectors  with  the 
input  data  and  checking  the  output  generated  by  the 
array  for  the  test  inputs  against  the  expected  results. 

The  occurrence  of  a  fault  after  all  spare  rows  and 
columns  have  been  used  docs  not  have  to  cause  the  crash 
of  the  system.  Graceful  degradation  is  jxissible  in  two 
ways:  (a)  by  reducing  throughput  and  (b)  by  eliminating 
degrees  of  freedom.  If  processor  in  row  i  and  column  j 
fails  then  (a)  the  row'  i  can  be  bypassed  and  a  neighbor 
row  is  time-multiplexed  to  replace  row  i  or  (b)  row  i  and 


column  j  can  be  bypassed  and  the  corresponding  degree 
of  freedom  ignored.  Notice  that  a  reduction  in 
throughput  may  require  the  system  to  ignore  some  sam¬ 
ples  but  the  degradation  affects  all  weights  instead  of 
simply  eliminating  one. 

.4n  alternative  to  the  use  of  test  vectors  and  complex 
diagnostics  for  the  detection  and  location  of  faults  con¬ 
sists  of  using  actual  receiver  sampled  data  and  time 
redundancy.  The  basic  idea  is  to  multiplex  the  systolic 
array  in  time  so  that  the  same  input  samples  are  pro¬ 
cessed  twice.  However,  for  the  second  processing  cycle, 
the  samples  are  circularly  shifted  so  that  column  i  of  the 
array  receives  the  same  data  received  by  column  i-1  in 
the  first  processing  time  (column  1  would  receive  the 
same  data  received  by  column  N  in  the  previous  step). 
Infernally,  each  processor  can  then  compare  its  result 
with  the  result  computed  by  the  neighbor  processors  for 
the  same  data. 

Figure  7  shows  a  (6x6)  square  ar.''ay  module  as  an 
extension  of  the  (6x6)  basic  triangular  array.  Note  that; 
(I)  the  diagonal  elements  are  capable  of  performing  as 
S/\A-1  or  SAA-2  cells  and  (2)  the  upper  triangle  of  SAA-2 
elements  in  the  basic  array  is  replicated  below  the  diago¬ 
nal  of  square  array  module.  The  basic  idea  underlying 
the  extensibility  and  universality  of  this  square  array 
module  is  illustrated  in  Figure  5.  This  figure  shows  how 
a  large  triangular  array  can  be  generated  by  replicating 
the  square  array  module.  The  replication  can  be  done  in 
time,  i.e.,  by  time  multiplexing  the  square  array  so  that  it 
emulates  the  large  triangular  array.  The  replication  can 
also  be  done  in  space,  i.e.,  several  identical  modules  are 
simply  tiled  together  until  the  large  triangle  is  covered. 
Partial  space  and  partial  time  replication  is  also  possible. 
Thus,  the  square  array  module  can  serve  as  the  building 
block  the  systems  with  different  customer  requirements 
and  intended  for  different  applications.  These  basic  ideas 
are  similar  to  those  discussed  in  [7j. 


VI.  Conclusions 

We  described  the  design  of  a  high  level  systolic 
architecture  for  adaptive  signal  processing  in  high  perfor¬ 
mance  advanced  communication  and  radar  systems.  The 
main  characteristics  are  extremely  high  throughput,  fast 
response  time  and  high  reliability  as  result  of  marrying 
advanced  GaAs  technology,  a  sophisticated  algorithm  and 
innovative  concepts  in  computer  architecture  and  fault- 
tolerance. 
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Figure  I  Sjstdlie  array  operation  (3-(iimensional 
representation). 
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Figures.  (128x128)  basic  array  augmented  with  16  spare 
rows  and  16  spare  columns. 
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Figure?.  A  (6x6)  square  array  module  as  a  result  of  ex¬ 
tending  a  (6x6)  triangular  basic  array  by  (a) 
designing  the  diagonal  elements  so  that  they 
can  behave  as  either  SAA-1  or  SAA-2  cells  and 
(b)  adding  a  lower  triangle  of  SAA-2  cells. 
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Figured.  Reconfigured  (128x128)  array  with  one  spare 
row  and  one  spare  column  and  a  faulty  proces¬ 
sor. 
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3.6  ALGORITHM  RECONFIGL RATION 
TECHNIQUES  FOR  GRACEFULLY  DEGRADABLE 
PROCESSOR  ARRAYS 


Jose  Fortes 


INTRODUCTION 

Opfraiion^l  fault-iolrraoc^  la  VLSI/Wsl  prorMSof  arrays  remaios  an  importani  obstacle  to  the 
Widespread  use  of  such  architectures  lo  particular,  graceful  degradation  is  hard  to  achieve,  thus 
’tnplytng  the  need  for  large  amounts 'if  redundancy  Without  graceful  degradation,  after  redundancy 
IS  exhausted,  any  additional  fault  causes  the  entire  system  to  fail,  a  unacceptable  fact  for  the  very 
large  processor  arrays  made  possible  by  VT.SI/WSI  Some  solutions  have  been  proposed  for  this 
problem  In  these  relatively  few  fault  tolerance  srhemen  graceful  degradation  is  achieved  at  the  cost 
')f  largo  losses  in  throughput  or  response  time  <ostly  additional  luterconnect,  complex  switching 
mechanisms  and/or  involved  control  schemes 

A  promising  approach  to  this  problem  relies  on  using  simple  on-line  algorithm  reconflguratjon 
techniques  together  with  «impie  hardware  reconhguratioo  mechanisms  In  essence,  algorithms  are 
reconfigured  so  (hat  they  can  execute  on  the  same  processor  array  after  the  occurrence  of  fauUs  and 
possible  removal  of  processing  elements 

In  the  space  allowed,  this  paper  shows  how  rational  quast-affine  (HQ.AI  algorithm  irao5fcrm> 
tioos  can  be  used  to  devise  such  reconfiguration  schemes.  It  describes  the  maibemaucal  framework 
underlying  our  techniques,  discusses  examples  and  three  approaches  baaed  on  a  common  RQA 
transformation  which  yields  optimal  graceful  degradation  and  briefly  discusses  rxiensions  of  our 
approach  to  J-dimensiooal  arrays. 


MATHEMATICAL  FRAMEWORK 

[n  the  following  discussion  use  7.  and  I  to  denote  the  set  of  mlrgers  and  the  set  of  nonnegar 
tive  integers,  and  use  Z®  and  I"  to  refer  to  their  corresponding  nth  Uariesian  powers.  We  will  con¬ 
sider  only  q-dimeosional  processor  arrays,  where  q  =  1.2.  We  see  a  processor  array  as  a  finite  q- 
dimensional  grid  m  which  each  integer  point  is  a  vector  index  of  a  processor  and  a  set  of  vectors  (the 
interconnection  primitives)  which  describes  the  regular  pattern  of  interconnectioDs  of  the  array  The 
following  definition  formalizes  this  view. 

Definition  1  -  A  prorrssor  firriy  is  a  tuple  (LTP)  where  q  is  the  dimfn5ion  of  the  array,  L^7,'^  is 
the  index  ifet  and  is  a  matrix  of  r^I  tnterconne<‘tion  primittvee 

Thus,  in  a  processor  array  (L'*.p),  the  processor  with  index  is  connected  to  a  processor 

with  index  f '  =  p,  p€P.  »f  ^od  it  is  connected  to  an  input-output  port  otherwise 
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ExampU  1:  The  linear  processor  array  shown  in  Igure  l)a)  can  be  described  by  (L'.P)  where 
L'  =  {II  :  0  <  9  <3}  and  P  =  [0  1  -1|.  The  square  processor  array  of  Igure  l|b)  can  be  described 
by  (L^P)  where  L*  =  {»,,»;)  ;  0  <  f ,.  ^2  <  3{  and 

0  10-1  oj 
^  “  00  1  0-1 


Figure  I  (a)  -  A  linear  processor  array:  |b)  -  A  orthogonal  processor  array 

The  execution  of  an  algorithm  on  a  given  array  can  be  thought  of  as  an  ordered  set  of  instan* 
tiations  of  the  array,  each  of  which  contains  an  assignment  of  computations  to  processors  at  a  partic¬ 
ular  time  of  execution.  Consequently,  we  see  an  array  algorithm  as  a  jq-*' iH^ntBsiooal  grid  in 
which  (a)  for  q  =  I.  each  integer  point  (jijj)^  indexes  a  computation  at  time  and  processor  ii  and 
(b)  for  q  =  i,  each  integer  point  indexes  a  computatioo  at  time  j,  and  processor  (jjj)'  In 

addition,  we  associate  with  each  array  algorithm  a  set  of  vectors  (called  dependence  vectors|  which 
describes  the  pattern  of  generation  and  use  of  data  in  time  and  space.  In  other  words,  if  a  compute 
tion  with  index  j  generates  a  value  used  in  computatioo  with  index  j' ,  then  /  -  j  is  a  dependence 
vector.  Clearly,  the  6rst  entry  of  any  dependence  vector  must  be  >  1  (i.e.,  at  least  one  unit  of  time 
separates  generation  and  use  of  a  variable),  and  the  vector  corresponding  to  the  other  entries  must 
correspond  to  a  linear  combiuation  of  interconnection  primitives  (i  e.,  a  path  connecting  the  proces¬ 
sors  where  the  variable  is  generated  and  used).  .Assuming  that  communication  (over  a  single  inter¬ 
connection  primitive)  and  execution  of  a  computation  take  one  unit  of  time,  the  number  of  intercon¬ 
nection  primitives  used  to  communicate  a  result  from  computation  with  index  ;  to  computation  with 
index  j'  must  also  be  less  than  or  equal  to  the  first  entry  of  the  dependence  vector  j'  -  j  (i.e..  the 
interval  ol  lime  between  the  compulations).  These  considerations  translate  into  the  following 
definition  of  orre-  -orithm 


Ouflnltlor.  *  .1  an  array  (L^P),  PCZ"*''’.  An  array  alyarilhm  is  a  tuple  where 

.  indcj  rci  of  the  algorithm,  and  D^X****"*^'*"”’  is  a  dependency  msfrix  of  m6l  depen¬ 
dence  vec  such  that 

d|i  >  I  i  =  1 . m  (1) 


jd,t  X  -’.q  +  l.  i  =  1 m  =  PK  for  Kel^'"®*  such  that  V  kj,  <  d,i  ,  i  =  I . m  (2) 

In  this  definition  of  array  algorithm  we  represent  only  the  structure  of  the  algorithm  and 
abstract  from  the  actual  computations  being  performed.  This  is  adequate  because  we  are  essentially 
worried  with  problems  of  matching  computational  structures.  Also,  input  and  output  data  are  not 
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explicitly  repreeented  because  they  eao  be  treated  as  geoerated  data  {le,  for  a  given  processor 
receiving  data  from  other  processors  there  is  no  distinction  between  data  generated  and  data 
"passed"  by  those  processors).  Finally,  the  description  of  dependences  would  be  more  precise  if  to  a 
given  dependence  vector  we  associate  the  index  point  where  the  dependence  is  valid.  This  complica- 
tioD  turns  out  to  be  unnecessary  for  the  derivation  of  our  mam  results. 


Example  It  Consider  the  linear  systolic  array  shown  to  figure  2  for  roovolutioo  romputaiioo  (pro¬ 
posed  in  [Li  &  Wab  d5|}.  It  computes  the  recurrence 


*I4-Ir-| 


ak  -  a^ 


(3) 


for  m  — 3  and  o  =  5.  The  algorithm  executes  in  9  units  of  time  and  some  processors  are  idle  only  dur¬ 
ing  the  initial  and  final  phases  of  the  computation.  Variables  a"  ^tay  always  to  the  same  processor 
and  variables  'y"  and  "x"  move  to  the  right  neighbor  processor  every  one  and  two  uotls  of  time, 
respectively  This  array  algorithm  can  be  described  by  where  J*  =  ^  ^  h  ^  3. 

ia  $  ii  £  ^  ij}'  •  ^  •  Ji  ^  ^“<1  ia  processor  index,  and 

1  3  I  (j,) 

0=01.  '^1 

(a)  (O  (.v) 


where  the  second  row  corresponds  to 


PK  =  |0 


-'I 


I  0  Oi 
oil 
0  0  (), 


Figure  J  ■  Systolic  array  for  cuovoIuIiod  (n  =  S,  m  =  3) 

Since  we  ve  interested  in  general  purpose  ilgorithm  recoufiguritioo  sebemes,  we  will  consider 
"worst  case"  array  algorithms.  In  other  words,  we  will  consider  algorithms  which,  at  any  time  d  ir* 
lag  execution,  use  all  processors  and  all  iotercoonection  lioks  of  the  array  Thu.s.  for  a  linear  array 
algorithm  which  takes  T  uoits  of  time  to  execute  on  a  linear  array  with  N  processors  we  have 
where 


J’  = 


(ji,jjr  0  <  )i  <  T-i .  0  < ),  <  N-i 


and 


D  = 


I  1  1 

0  1  '1 


Figure  3(a)  illustrates  this  for  T  =  3.  N  =  i  Similarly,  for  a  square  orthogonal  array  algorithm  with 
execution  time  T  and  (NsN)  processors  we  have  (J*.DI  where 


J*  = 

. 

(ii.Jj.isr  0  <  „  <  T-l  .  0  <  jj.i,  <  N-l 

and  D  = 

111  11 

0  0  1  0-1 

0  10-1  0 

Mereon  and  unless  otherwise  stated,  wr  use  the  term  array  algorithm  to  mean  -a  worst  case 
array  algorithm.  We  are  interested  in  the  case  when,  due  to  the  failure  of  one  processor,  the  original 
array  algorithm  must  be  executed  by  a  smaller  processor  array  This  requires  that  the  algorithm  be 
reconfigured,  i  e  .  that  operations  initially  allocated  to  faulty  processors  be  remapped  into  the 
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op«rational  processing  elements.  This  is  equivalent  to  saying  that  we  need  to  obtain  a  new  array 
algorithm  by  transforming  the  original  array  algorithm,  .\lgoritbm  transformations  have  been  stu> 
died  extensively  and  are  reported  in  (Fortes  &  Raghavendra  85|  and  (Moldovan  &  Fortes  8b(  and  the 
references  thereof.  In  (Fortes  £  Raghavendra  85|  it  was  shown  that  a  simple  transformation  can  be 
used  to  reconfigure  array  algorithms  with  unidirectional  data  movements.  It  was  also  shown  that  any 
array  algorithm  can  be  transformed  into  an  equivalent  array  algorithm  with  unidirectional  data 
movements,  thus  making  that  scheme  generally  applicable  One  of  the  disadvantages  of  this 
approach  is  that  the  equivalent  algorithm  may  be  slower  than  the  original  one.  Another  disadvan¬ 
tage  IS  the  requirement  for  "wrap-around"  links  between  processors  at  the  boundaries  of  the  array 
To  show  the  impact  of  this  approach  on  the  "degraded"  performance  of  the  array  with  faulty  proces¬ 
sors  we  need  to  discuss  this  scheme  in  more  detail.  Consider  the  case  of  a  linear  array  algorithm 
which  executes  in  T  units  of  time  on  .N  processors.  Assume  that  data  movements  are  unidirectional 
and  one  processor  fails  The  remaining  operational  processois  have  virtual  indices  ranging  from  0  to 
.M-2  The  reconfiguration  is  done  by  simply  mapping  a  computation  originally  performed  in  processor 
jj  at  time  j,  (i.e.,  point  (ji.jj))  into  processor  jjmodl.N-l)  at  time  ji  +  [ijy(N-2)|  T.  This  means  that 
response  time  is  doubled  and  that  the  average  throughput  is  halved. 

The  reconfiguration  technique  proposed  in  Ibis  paper  does  not  suiter  from  the  drawbacks  dis¬ 
cussed  above  It  evolved  from  the  theory  of  linear  algorithm  transformations  ([Fortes  £  Raghaven¬ 
dra  S."!),  [Moldovan  £  Fortes  86()  whose  basic  ideas  are  explained  next  The  reorganisation  of  an 
algorithm  corresponds  to  a  permutation  of  its  index  set  and  can  be  described  as  a  linear  transform> 
lion  ■j-gz'iu+ii-'s+iii  T  jj  nonsingutar  The  first  row  of  T.  denoted  s.  is  referred  to  as  time 

transformation  and  the  remaining  submatrix  of  T.  denoted  S.  is  called  a  .space  transformation.  In 
other  words.  T  reorganiies  the  algorithm  so  that  a  computation  with  index  j  in  the  original  algorithm 
IS  executed  at  time  -j  and  processor  Sj  (i  e  .  the  index  in  the  transformed  algorithm  is  (xj.Sj)^).  Due 
to  the  linearity  of  the  transformation,  the  dependence  matrix  of  the  transformed  algorithm  is  simply 
TO.  where  D  is  the  dependency  matru  of  the  original  algorithm.  Of  course.  T  must  be  selected  so 
that  the  new  algorithm  is  an  array  algorithm,  ie..  (I)  and  (2)  are  satisfied.  To  illustrate  this 
approach  the  reader  can  verify  that 


l-l  1 


can  be  used  to  transform  the  algorithm  described  by  (3),  for  which 


D  = 


-I  -I  0 
0  I  1 


into  the  array  algorithm  of  figure  ,3  for  which  the  dependence  matru  is  (1),  i.e  .  TD 


NEW  RECONFIGURATION  SCHEMES 

In  this  paper,  we  consider  rational  quasi-affine  (RQ.A)  transformations  of  the  form  jTj  +  t|, 
where  x^qUu+D'iu+iii  j^q(<i'*'U  jqj  q  denotes  the  set  of  rational  numbers  .\8  for  linear  transfor¬ 
mations.  T  consists  of  a  time  transformation  s  and  space  transformation  S.  Time  transformations  of 
this  type  are  discussed  in  (Fortes  £  Parisi  84|  and  a  full  discussion  of  RQ.A  mappings  will  appear  in  a 
forthcoming  paper  Clearly,  the  class  of  RQA  transformations  is  a  superset  of  that  of  linear  ma|v 
pings  mentioned  in  the  previous  section.  However.  RQ.A  transformations  for  which  T  is  nonsingular 
do  not  necessarily  correspond  to  one-to-one  mappings.  Thus,  before  considering  an  RQA  transforms 
tion.  one  must  show  that  it  indeed  specifies  an  injective  mapping.  In  addition,  conditions  similar  to 
those  used  to  select  linear  transformations  must  also  be  used  to  choose  RQ.A  mappings.  As  men¬ 
tioned  before,  a  valid  algorithm  traotfomiatiott  must  yield  a  new  algorithm  for  which  the  dependen.'v 
matru  satisfies  (I)  and  (2).  For  a  linear  transformation  T.  the  new  matru  is  simply  TD  where  0  is 
the  dependency  matru  of  the  original  algorithm.  For  RQ.A  transformations,  this  is  not  true.  How¬ 
ever.  It  IS  still  possible  to  define  conditions  which  ensure  that  (1)  and  (2)  are  satisfied.  Note  that  for 
any  X.  Y  and  W  the  value  of  LX/Wj  -  [Y/Wj  is  either  l(X-Y)^Wl_or  ((X-Y)/Wl  Hence,  for  the 
transformation  R  mentioaed  above  and  any  dependence  d  =  j  -  j' ,  we  have  that  the  value  of 
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R(j)  ~  R(i' )  |p^|]  T  is  u  defined  above  and  the  notation  II  ]J  means  that  any  entry  in  Td 

can  be  replaced  by  either  its  ceiling  or  floor  value  Thus,  a  valid  RQA  transformation  must  be  nuch 
that  IITDII  satisfy  (1)  and  (2). 

We  start  by  discussing  the  case  of  linear  array  algorithms.  Hereon,  we  will  only  consider  KQA 
transformations  of  the  type  introduced  in  the  next  theorem.  The  theorem  shows  that  such  transfer* 
mations  are  injective  and.  in  addition,  reversible  in  the  integers  Afterwards  we  show  that  [ITDIJ 
satisfy  (I)  and  (2|.  We  use  the  symbol  F  to  denote  the  set  of  real  numbers  and  =  R.  R. 


Theorem  1 

Let  J"CR^  and  J  =  JTII*  Consider  the  RQA  transformation  R  J-*L  such  that 

R(7t  =  {tI  +  rj 

where 


(6) 


T  = 


n 

-  1 

a  1 

S 

a-1 

-1  a-2 

.  lei .  1  >  - 


(■) 


and 


I 

1-1 


I) 

i-i 


ind 


1.  =  ltii'.  l*  =  jr  r  =  T)*  +  t. 

The  irinsrormiuoa  R  is  a  bljection. 

Proof 

We  show  that  R  is  both  a»  Injection  and  a  surjection  and  thus  it  must  also  be  a  bijection 


(a)  R  IS  an  injection  -  By  contradiction.  .Assume  that  j.  j'€J.  ji'j’  md  ?  =  R(j)  =  R(P)  =  ? 
We  show  that  this  implies  J  =  j' .  i  e  ,  0  =  j  -  j'  =0  RIj)  can  be  reexpressed  as 


0 

1 

1  I 

0 

>, 

7  + 

1 

a-l 

-l-l 

h 

+ 

-! 

- 

ft 

and.  because  l-(x  +  l)/k|  =  -  [l  +  Ix/kl]  for  all  x  and  k.  we  have 

?i  =  Ji  +  [(ji  +  ls)/(a-I)j  “<l  Ifj  =  Jj  -  [(ji  +  i2)/(»-llj 
The  assumption  ?  =  I/'  implies 


(8) 


+ 


I  +ii  +h  +  ^; 


a-1 

^i+ji+jj  +  ^i 


a- 1 


^i+)c 


a-1 

‘S+ii 


a-I 


+  f,  =  0 


=  0 


Substituting  for  +  J-  in  the  Boor  functions  of  the  above  equations  implies  f>i  =  I's  =  0,  i  e  .  J  =  0 


|b)  R  is  a  surjection  -  Since  |T|  =1.  L*  has  the  same  area  of  J’  Thus.  L‘  cannot  contain 
more  integer  points  than  J'  does.  Since  R  is  an  injection  it  must  also  be  a  surjection  (pigeonhole 
principle).  Q  E  D 
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\i)>*  wr 'hnw  that  (iTUlJ  silisliM  ( 1 )  and  |  J)  In  f»<  l  wr  havf 


TD  = 


1 

a  1  1 

I  1  1 

_ l_ 

a  a  +  1  a-l 

a-I 

-1  a-2j 

0  1  -1 

a-I 

-i  a-3  -la-n 

(9) 


Th^  VKiir'.i  '-as'  ocnirs  whfn  wr  ialt»  thr  Hoof  of  th»  first  row  rntrifs  and  thr  mling  of  the 
absolute  values  m  'he  serond  row  This  rorresponds  to  the  case  when  the  least  time  is  avatlable  for 
data  communications  to  take  place  The  resulting  matrtx  is 


1  I  I 
-I  1-1 


(10) 


which  clearly  -atisfies  (I)  and  (J).  Another  possible  matru  resulting  from  IITDI]  corresponds  to  the 
case  when  the  ceiling  and  floor  functions  are  applied  to  the  first  and  second  rows,  respectively.  The 
resulting  matri.x  is 


0  0  -1 


(111 


which,  expectably  also  satisfies  (1|  and  (2).  It  also  becomes  obvious  that,  .at  different  execution 
times  the  same  variables  could  move  every  unit  of  time,  or  move  every  two  units  of  time  or  stay  in 
the  same  processor  for  two  units  of  time.  This  may  suggest  that  data  timing  and  movement  are  hard 
to  predict  Fortunately  this  is  not  the  case  and  much  information  can  be  derived  from  our  formal¬ 
ism  For  example  it  can  be  shown  that  for  any  dependence  <1.  the  only  values  of  [[Td|j  which  laecur 
in  the  new  algorithm  correspond  to  those  where  the  celling  function  is  applied  to  one  entry  of  d  and 
the  floor  function  is  applied  to  the  other  entry  (a  consequence  of  Ji  +  1/;  =  ji+j;  from  (8)).  In  other 
words,  the  new  dependencies  correspond  to  vectors  present  in  the  matrices 


I  1  1 
0  I  -I 


and 


2  2  I 
•1  0-1 


\n  important  implication  is  the  fact  that  buffering  (local  memory)  is  required  (for  the  first  two 
columns  of  ihe  second  matrix).  As  another  example,  consider  the  first  2  entries  of  the  second  row  of 
|Q)  and  consider  ihe  ((uestion  of  finding  out  when,  for  a  given  processor,  the  corresponding  variable 
moves  or  remains  in  the  processor  (i  e  ,  when  the  ceiling  and  floor  values  are  valid)  It  is  possible  to 
show  that  the  ceiling  function  is  valid  (a-2)  out  of  every  (a-l)  times  for  the  first  entry  and  (a-3)  out 
of  every  (a-I)  times  for  the  second  entry  (in  a  periodic  manner).  Similar  deductions  can  be  done  with 
respect  to  timing  and  data  movement  for  other  individual  processors  and  variables. 

As  for  linear  i r.insformalions.  graphic,al  representations  of  RQ.\  mappings  are  quite  insightful. 
For  linear  transformations,  the  locus  of  .rj  =  constant  in  the  index  set  of  the  original  algorithm 
corresponds  to  a  plane  or  line  which  describes  a  computational  wavefront  (i  e..  the  execution  of  com¬ 
putations  whose  indices  belong  to  the  wavefront  takes  place  at  the  same  time).  Likewise,  the  locus 
of  Sj  =  po  contains  the  indices  of  computations  executed  by  processor  po.  For  the  case  of  RQ.\  map¬ 
pings.  the  locus  of  Itj  +  tJ  =  constant  corresponds  to  several  consecutive  wavefronts  which  are 


computed  simultaneously  The  locus  of  jSi  +  t-j  =  p,  contains  all  indices  of  computations  executed 
by  processor  po 

The  transformation  R  given  hy  (fi|  is  the  basts  for  three  reconfiguration  schemes  to  be  described 
later  in  this  paper.  In  general,  when  reconfiguring  an  algorithm  for  execution  on  a  linear  array  with 
.V  processors,  one  of  wbirb  is  faulty,  we  assign  the  value  .N  to  the  parameter  a  in  R.  In  order  to 
illustrate  the  concepts  introduced  above,  we  oow  discuss  an  example  for  which  .N  =  a  =  5,  Thus,  we 
have,  from  (6). 


R|J)  = 


1 

5  1 

I 

0 

1 

3  6  A 

4 

-1  3 

J  — 

1 

3 

aod  TD  =  - 
4 

-I  2  A 

Figure  3  illu'irates  several  ideas  discussed  before.  Figure  3(a)  shows  the  original  worst  case 
algorithm  before  any  fault  occurred.  The  dots  represent  computation  indices  and  the  arrows  depict 


vv  ■  •K 


Fault  Tolerance  and  Test 


265 


data  movemeni.  Wf  assume  ao  exmjnon  tim*  -.f  a-1  -  4  units  *•/  time  ithe  r.-j>on  will  h^t  ..m**  i 
laier).  Figure  Mb)  depiris  ihe  oomputaiioaaf  wavefrcmts  (full  lines)  and  (he  nme  when  (hey  ire  e<p. 
euted  as  determined  by  R  (the  >ymbo{  .'denotes  «*>cerutioo  time  m  (be  new  aig'  rjthin )  The  br<’ken 
lines  contain  indices  of  computations  executed  by  the  same  processiir  (we  use  the  -vmb«<l  r  for  pf^- 
ressor  indices)  Figure  ;'(c)  shows  the  same  information  as  figure  Mb)  in  difTerenl  form  together  wjih 
the  movement  of  data  The  cr<vsshatrhed  ‘  bars  ■  ontaio  indices  of  computai ii>ns  executed  at  the 
same  time  I  he  dotted  bars  contain  indices  of  <'i)mputatioDS  executed  by  the  varne  processor  The 
following  properties  '->(  the  reconfigured  algorithm  •liscussed  previously  are  now  readily  apparent. 
First,  notice  that  oqI>  N-i  =  \  processors  are  u.sed.  i  e.  the  faulty  processor  is  Dot  required  Second 
DO  arrow  crosses  a  dotted  bar,  i  e  .  ail  communication  occurs  between  operational  neighboring  proces¬ 
sors  |we  will  comment  on  necessary  reconfiguration  hiu-ware  later).  Third  “ome  data  must  be 
buffered  in  each  processor  for  two  units  of  time.  This  occurs  whenever  an  arrow  crosses  i 
crosshatrhed  bar  .^s  predicted,  stationary  data  in  the  original  algorithm  rematn.s  lO  the  same  proces¬ 
sor  three  out  of  four  steps  m  the  new  algorithm  lie.  |a-J)/(a-l)  =  -/M,  data  vhifting  to  the  right 
stays  in  the  ».ame  prnrc^sor  two  out  of  b  ur  steps  fic  (a-.?)/|a-f|  =  J/l)  and  data  shifting  left  moves 
in  every  step  Finally,  from  figure  Mbi.  it  is  clear  that  each  time  and  spare  wavefront  contains  a 
single  index,  i  e  .  -  ach  computation  belongs  to  a  different  wavefront.  Since  the  smallest  entry  m  the 
first  row  of  the  matrix  given  by  (V)  is  a- 1  =  i.  we  know  that  at  most  that  many  computations  can 
occur  simullane<'U‘'l>  ^Fortes  ^  Farisi  Stl).  Hence,  only  that  many  processors  ire  needed 

Note  ihat  'hiv  rectintiguraihin  ••  henie  is  optimal  nice  no  ofieraiion  »i  procpss.if  ..vrr  idle 
the  executum  time  js  increased  (he  po-ssible.  j  e  .  by  \  factor  given  by  the  ratio  of  the  number  of 
all  proces.sors  over  the  number  of  operational  processors  )'>/!  in  this  case)  U  >  interr^iiing  to  note 
that  a  necessary  ‘■(^ndilion  for  a  transformation  to  pfe^er\e  the  product  of  the  number  of  pr  -cessors 
by  execution  time  is  that  it  be  reversible  in  the  integers  Theorem  1  proved  that  /<  satisfies  tbis  con¬ 
dition.  With  respect  to  throughput,  a&sume  that  (he  original  algorithm  accepted  a  new  input  and 
generated  a  new  output  every  unit  of  time.  The  new  algorithm  accepts  1  new  inputs  and  generates  4 
new  outputs  every  >  units  of  lime,  which  is  also  optimal 

Now  let  us  consider  the  gener.il  case  when  the  original  array  algorithm  has  exeruijno  time 
larger  than  a*l  =  I  units  of  time  The  algorithm  which  results  from  applying  R  h.is  the  same 
characteristics  as  before  including  the  fact  that  at  most  4  processors  are  used  at  anv  ime  However, 
the  indices  of  these  processors  are  not  restricted  to  range  from  0  to  ^  and  more  th.io  I  processors 
would  be  required  To  solve  this  problem,  we  consider  three  possible  schemes  b.ased  op  the  transfor- 
maiioo  R.  VVe  describe  them  ne.xi  and  discuss  iheir  characteristics  afterwards  I.et  (•  .i)  d'noie  the 
image  in  the  reconfigured  algorithm  of  the  index  j  =  lji  j.)  >n  original  array  algorithm  The  three 
possible  schemes  are  .as  follows: 


Scheme  1  -  (fi;))  =  modja- 1 1)  where  (^ ,.  ;*  ~)  =  (/  =  R(H 


Scheme  J  -  (f.'?)  =  (i) ,+  a  jj,/(a-l  )J  .  I  where  |f , Vc)  =  =  Hljimod(a-U.  jJ 

Scheme  3  -  (r.p)  =  (?  j  +  a  |ji/U"l  ij  .  ^c)  '''here 

R(jimod(a-I|j-)  if  jji/la-njmod  2-0 

IJ .  i)  „!  =  i;  =  . 

‘  *  R  (jjmod(a-l).jj)  otherwise 


where  R  is  such  that 

a- 1 
0 

Though  it  may  be  possible  to  analyte  the  three  erbemes  matbematically .  it  is  easier  to  explain 
them  by  referring  to  figure  3  For  execution  limes  of  the  original  algorithm  larger  than  a-l  (a=  3  for 
the  example),  scheme  1  "ssentially  replicates  figure  3  every  additional  a-l  units  of  time  However,  for 
every  replica,  id  addition  to  changes  lo  the  time  indices,  the  processor  indices  are  increased  by 


R  (])  = 


1 

a  -1 

1 

3-1 

1  a-0 
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lr/|i-l)J  mi><l(i-l)  Kur  i-xampl*',  rninputatioDS  wuh  lodim  (a- 1 .0)  throuKD  la-1  a-J)  arf  expruti-d  at 
time  r=->  by  procf^ssofs  witb  iodicPA  .'J.  0.  I  and  Z  id  this  ord^r  Computation  at  ind<*v  |vl  a-1)  is 
executed  at  fi  by  pror»»ssof  2.  It  is  ^asy  to  set  that  this  srh^me  will  rpguire  a  wrap-around  ' 
Ilok  b<*twffn  tb^  first  and  last  professor  of  tb#*  array 

^chfmf  J  also  grn^rates  a  replica  of  figure  \  ev^rv  a-I  units  of  time  In  this  case  the  replira  is 
exact  for  the  processor  indices.  However,  at  the  interface  between  consecutive  replicas  the  move¬ 
ment  of  data  associated  with  the  original  dependence  {I  U‘  does  not  occur  between  adjacent  opera¬ 
tional  processors,  If  such  dependence  is  not  present  (e  g  ,  lo  an  algorithm  which  i.s  not  a  worst  .  ase 
computation)  then  this  scheme  is  acc:»ptable  For  example,  computations  with  indices  (a-l,0)  through 
la-1  a-J)  are  executed  at  tim^  '  =  o  by  processors  with  indices  0  through  3  and  computation  with 
index  (a-l.a-1)  is  e.xecuied  at  time  r  =  6  by  processor  3 

Scheme  3  also  generates  a  replica  of  figure  -3  every  a-I  units  of  lime.  However,  successive  repli¬ 
cas  are  mirror  images  of  each  other  (except,  of  course,  for  the  arrows  depicting  data  movement! 
This  scheme  does  not  have  any  of  the  disadvantages  of  previous  approaches  Note  that  R  is 
obtained  from  R  by  -hanging  the  sign  of  the  off-diagonal  entries  of  T  and  changing  the  value  of  t 
from  (0,  (a- J|/(a- 1  fo  )/(a-I ).  0)^  =  |1  0)^  Graphically,  the  net  result  of  these  changes  is 

ihe  reversal  of  the  -ign  of  the  <|op«»  of  the  wavefronts  shown  in  figure  3(b)  This,  in  turn,  yields  the 
mirror  image  of  ihe  figure  ’(r)  generated  by  R.  For  example,  computations  with  indices  (a-1.1) 
through  la-l,a-l)  ire  executed  .it  Mme  -•=:'>  by  processors  0  through  3.  and,  comput.ation  with  index 
(a-l.U)  IS  executed  by  prnce^isof  (I  \  liine  ’’  -  f) 

It  rem.tins  to  discuss  the  .  j^se  when  there  are  .V  faulty  processors  where  .\  m  be  larger  than 
one  The  solution  is  simple  and  consists  of  recursively  applying  the  propo.sed  'chcmciv)  \  times 
(dearly,  N-l  faults  can  be  tolerated  with  minimal  performance  degradation  m  a  linear  proce«^sor  with 
N  proce.ssicg  elements. 


HARDWARE  REQUIREMENTS 

'‘dmple  additional  hardware  is  required  to  support  the  algorithm  reconfiguration  ^<•berles  .hv 
cussed  here  It  must  be  possible  to  bypass  each  faulty  processor  Thus.  'Wit.  nmg  hardware  is 
minimal  Also  addiliona)  local  memory  is  required  for  each  processor  n  an  .amount  proportional  to 
the  number  of  faults  to  be  tolerated.  The  constant  factor  is  rather  small  and  the  reaiJcr  >  \o  easily 
verify  that  for  'he  ‘•ystolic  array  of  example  *  this  constant  is  t  In  fact  if  we  re>er'«e  ihe  ».ign  (  the 
coordinate  j|  of  'he  index  set  for  that  example,  this  constant  is  -  for  the  resulting  .ilg  'riihm 
Finally,  one  must  al>o  consider  the  implications  of  implementing  on-iine  the  recoDhguration  schemes 
00  the  complexity  of  the  control  and  host  interface  hardware  It  must  be  possible  to  impleme'^i  R  m 

real-time  .\s  mentioned  lO  the  proof  of  theorem  1.  #  =  ;  /  j  =  Rij|  implies  that 

-  Ji  **■  |lJi  U  j  z  “  l2  ~  jOi  I '  j  Thus  R(j)  cm  be  computed  wnb  at 

two  adders,  one  divider  and  one  subtracter  .Note  (bat  the  floor  funcH«»05  and  the  modulo  oprraiions 
which  are  also  required  for  each  scheme  are  easily  done  by  discarding  or  masking  hits  of  a  number 
(d  addition  to  the  computation  of  R.  scheme  3  also  requires  the  compuiaiion  of  R  (t  is  relatively 

easy  to  ‘•how  that  /  =  “  R  Ijl  implies  that  f.  =  ji  ■*"  !  *■  and 

^  z  ~  h  ||Ji 1 1|  Thus,  the  *.ame  hardware  ran  be  u'^ed  to  ■■ompute  \\  ,ind  R  Hence 

hardware  requirements  .ire  rather  'mall. 


TWO-DIMENSIONAL  ARRAYS 

'^ince  our  formalism  and  ba.sif  ideas  are  applicable  to  '-dimensional  arrays.  Rt^A  lran‘‘form> 
tions  can  also  be  used  to  devi'-e  reconfiguration  Hcfieme^  for  these  arrays  Several  types  of  RQA 
iraDsformaiinns  are  useful  lepeoding  on  the  degree  of  har  'ware  reconfigurability  assumed  In  gen¬ 
eral.  optimal  gr.aceful  ijegradation  is  harder  to  achieve  than  for  linear  arrays,  unless  relaiiNely  rorrv 
plex  reconfiguration  har<lware  (s  used  Thw  seems  to  be  inherent  to  the  nature  of  the  interconnection 
structure  of  J-dimensional  arrays.  VShen  considering  very  simple  forms  of  hardware  reconflguratioo. 
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it  miy  be  necessary  to  logically  remove  operational  as  well  at  fai  Ity  processors.  We  have  studied 
several  schemes  with  advantages  over  previous  approaches  which  require  hardware  mechanisms  of 
comparable  complexity.  Due  to  space  limitations,  a  discussion  of  these  schemes  and  their  relative 
merits  is  not  done  here  and  will  appear  in  a  forthcoming  paper. 


CONCLUSIONS 

This  paper  described  three  related  algorithm  reconfiguration  schemes  which,  together  with  sim¬ 
ple  reconfiguration  hardware,  can  be  used  to  achieve  optimal  graceful  degradation  in  linear  processor 
arrays.  These  schemes  are  based  on  a  class  of  RQA  transformations,  a  new  type  of  algorithm 
transformations  introduced  in  this  paper.  While  our  results  have  general  applicability,  their  practical 
advantages  will  undoubtedly  depend  also  on  the  nature  of  the  implementation  and  intended  applica¬ 
tion  of  the  processor  array.  The  general  approach  described  here  can  also  be  applied  to  2- 
dimensional  processor  arrays  and  is  also  useful  for  mapping  arbitrarily  large  algorithms  into  arrays  of 
fixed  site. 
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This  pr\p»*r  icienufit»s  equivalcac<*?«  bc{w#*en  fwo  '^ysiemnl  ir  mei  botloioiih's  for  i  b-* 
licsign  of  sy?^(olir  drrny^  aod  illusirates  the  briiftiis  nf  unilerMundnig  ’  li.  s.' 
rclattoo?ihips.  I'hc  melbod**  are  he  parameter  meihtxi  of  l.i  :in<i  Wah  arul  thi* 
flependenry  raetho<l  of  Mohiovan  and  Fortes.  After  a  revu'w  of  the  ror*-  ideas, 
motiels  and  parameters  of  each  method,  mathematical  relaiior-s  [>e(we.'n  i(u-m  ire 
derived  The  nsefuinevj^  of  these  relations  is  ilfustratcti  hv  ho‘*  HI 

optimization  profe'lnreH  for  the  parameter  methotl  '^ogge**!  similar  prf>r/-dijre‘i  for 
the  lependency  tneiho<{.  (2)  'systolic  designs  for  roovoliitjon  and  (Jeronvoiiiiion, 
oPiaiQed  through  ddlercfil  methods,  r.m  be  mat  bemat  icallv  proven  lo  l>e  id«ntical 
and  td)  new  vyviolic  erpjations  for  the  parameter  method  rrsult,  from  ihe 
knowlefige  of  eipiivaient  erpiations  in  the  dependency  method. 

I  -  Introduction 

1q  this  paper,  iwo  proposed  niethoilologics  for  the  'vstemalic  de'i.gn  of  '•ystoiic  arras-' 
comparatively  studied.  They  are  the  data  depeodenev  melbotl  of  Moldovan  ami  Kories. 
and  ihe  parameter  method  of  Ft  and  Wab.  dOI.  liotl  and  expose  the  recondite 

relationships  and  equivalences  bets^eeo  the  two  meiho<fofogies  anti  use  this  lofornjji/on  to 
improve  them  and  verify  similar  designs. 

Section  11  provides  a  short  description  of  both  meihotlologies.  Seclioo  111  rsi abh.'hes 
equivalences  between  the  mathematical  expressions  used  to  systematically  design  systolic  arrays 
in  the  two  methods.  The  equivalences  of  section  111  are  nse<l  in  section  1\’  to  propose 
optimization  procedures  an<l  improvements  for  both  raethcKlologies  \ddiiionally,  i  he  iwo 
methods  are  used  to  obtain  systolic  arrays  for  the  convolution  algorithm  and  the  resulting 
designs  are  matheroatically  proven  to  be  equivalent.  Section  V  is  dedicated  to  conclusions 

n  -  Introduction  to  the  Parameter  and  Data  Dependency  Methods 

2.1  Parameter  Method  [9|,[l0| 

This  methodology  considers  be  design  of  optimal  pure  planar  systolic  arrays  for  a  cla.ss  of 
linear  recurrences  which  take  the  general  form 

"  /[^M  *  ^  ■  -Vkji]  ■  ^  I-  1  >' 

where  /  ;s  ihe  function  to  be  <*xecutetl  by  each  '•el!  of  the  .array  and  xli.k).  y(k.j)  are  linear 
indexing  functions  for  the  Lwo-\itmens»onal  input  variables  X  and  Y.  In  the  following 
presentation,  the  coefficients  of  :,j,  k  are  either  I  or  -I  One-diraensiooal  recurrences  hav^  ipf* 
general  form 

i,*'  ^  /[zF*  .  S,,II  . -Van].  i-i\  (.M  -M 

.More  gforral  recurrencfs  can  also  br  considered.  In  ibe  inieresl  of  brevity  we  lio  not 
consider  them  here.  However,  all  results  of  this  paper  can  be  extended  to  include  those  cises, 
as  reported  in  [12]. 
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Three  sets  of  parameters  are  used  to  cbaracteriie  a  systolic  array:  velocities  of  data  Bow. 
data  distri buttons,  and  periods  of  computatton.  The  velocity  of  a  datum  i  is  the  directional 
distance  passed  by  that  datum  in  one  clock  cycle  and  is  denoted  by  x.j.  The  distance  between 
two  PCs  is  deBned  to  be  one.  Thus.  Xj  must  be  less  than  or  equal  to  one  because  broadcasting 
is  not  allowed  in  pure  systolic  arrays. 

Data  distributions  are  defined  using  row  and  column  displacements.  For  two-dimensional 
input  and  output  matrices,  the  elements  along  a  row  or  column  are  arranged  in  a  straight  line 
and  the  distance  between  adjacent  elements  in  a  row  or  column  remains  constant  as  the  data 
Bows  through  the  array.  To  define  the  row  displacement  of  array  X.  suppose  that  the  row  and 
column  indices  of  X  are  i  and  j ,  respectively.  The  row  displacement  of  X  is  the  directional 
distance  between  j|  and  and  is  written  as  x„.  Similarly,  the  column  displacement 

is  the  distance  between  and  Xjj,  j.,)  and  is  written  as  Xj,. 

Periods  of  computation  are  described  using  two  functions,  and  r^.  is  defined  as  the 
time  at  which  a  computation  is  performed,  whereas  r^  defines  the  time  at  which  a  variable  is 
accessed.  The  periods  of  t  and  j  for  two-dimensional  outputs  are  defined  .as 
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positive,  ff  this  is  not  true  Tor  a  given 

recurrence,  the 

recurrence  can  be  rewritten  to  satisfy  this  condition.  In  computing  t,j.  -  'Ikil 

are  accessed  and  two  additional  periods  can  be  included  to  describe  this  interaction  They  are 

ty*  -  '■J’tjti.k -i|)  ”  (21  0| 

Depending  on  the  order  of  access,  ly,  and  tj,  may  be  negative.  Since  operands  to  be  used 
in  a  computation  must  arrive  at  a  PE  simultaneously,  the  magnitude  of  the  periods  must  equal 
Ik,  i.e.  .it  must  be  true  that 

Ik  =  |'k,|  =  l‘k,l  CM-S) 

The  periods  are  independent  of  the  indices  i.  j,  and  i.  and  they  must  be  greater  than  or 
equal  to  one  to  prevent  broadcasting. 

These  parameters  (velocity,  data  distribution,  and  periods)  can  be  combined  into  a  set  of 
equations  which  describe  the  operations  of  a  systolic  array.  These  equations,  for  the  two- 
dimensional  case,  are 
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hy<i 
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j'l. 
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'l^d 

(2.1.13) 
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(2.1. H) 

The  parameters  and  equations  described  previously  can  be  used  to  formulate  the  design 
process  as  an  optimization  problem.  In  the  following  equations,  the  expression  Ipj  represents 
the  magnitude  of  the  vector  quantity  p  The  design  problem  is  formulated  .as  follows: 


minimite  #PE  -  or  #PE  ■  T  or  T 

subject  to  (2  19  -  2.1  U) 


(2.1  15) 
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and  the  recurrence  determines  the  r< 
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Rpcnll  that  doscribos  'he  Qumher  of  rio^k  <'vrlr!j  whirh  rlapst*  hriw»'<‘n  two  ^‘^)^)^e  r  ut  r- - 
computaljons  %ariable  : .  aoci  (hat  repre'vruf s  (ho  dir^'cuona)  dro.inro  ( r:ivor‘‘«‘<l  hv 

datum  :  ;n  one  clock  -vr(<*.  Thus,  k,  m  (2  I  -0)  rrprr«'oni«  the  inacnK  uilo  of  (he  direct  uxi  :ii 
distance  {ravrrsed  by  a  datura  :  botwepo  its  use  »n  two  consecutive  conipiuations.  Similarly 
kn  3od  k,  represent  »bis  same  distance  for  variables  y  ami  i.  respectively  k.  .k>  and  k, 
describe  the  spatial  distance  covered  by  a  datum  between  its  use  In  two  consec.itiw 
computations.  The  .  and  i^nui  values  .ire  the  mxximum  period  values  ron>idcr*-^ij 

in  the  optimization.  Periotis  equal  to  or  ii^eaier  than  these  maximum  values  result  m 
completion  times  which  are  equal  *0  or  greater  than  the  serial  processing  time  '’on'^traint 
equations  (2.1.21)  and  (2.1,22)  prohibit  multiple  inputs  from  entering  a  m  one  .  vcle 
Clearly,  if  a  data  distribution  vector  is  equal  (o  zero,  two  or  more  d.ita  eiemeot.s  are  'iep.arated 
by  zero  distance  and  must  enter  a  PR  simult.ioeously  Thus,  the  formulation  of  the  ibsjijn 
problem  as  an  optimization  problem  as  given  id  equations  (2  10*2  1  M)  anti  (2  1  l'»*2.1  2‘M 
easores  that  the  resulting  array  saiislies  the  constramis  of  systolic  processing. 

The  optimal  systolic  array  for  a  given  recurrence  can  be  found  by  '^vstematically 
enumerating  the  po.ssible  solutions  using  a  search  order  that  guarantees  that  the  first  feaMble 
solution  found  is.  in  fact,  the  optimal  one.  t^’onsider  optimizing  T.  the  total  lime  needed  «o 
complete  the  computation.  First  set  k|  =  k>  -  k^  ”  I  or.  if  a  particular  variable  p  ;*<  to 
remain  in  the  same  !’K.  set  the  associated  Vp  -  0.  Then,  set  the  magnitudes  of  the  penoijs 
t,  ,  tj  ,  and  equal  (o  one  and  determine  if  a  fea-sible  solution  exists  If  a  feasible  solution  is 
found  it  is  the  cptiraal  solution  for  T  because  T  is  a  linear  function  of  the  perio<ls  (hat 
increases  monotooically  with  increases  in  the  magottude  of  these  period.s.  If  no  feasible  solution 
is  found  with  t,  =  tj  =  1.  one  of  the  periods  is  increased  by  one  and  the  search  for  x 
feasible  solution  repeals.  If  no  feasible  solution  ran  be  found  with  kj  =  k.  =  kj  =  I  .  one  of 
the  k,  is  increased  by  one  and  the  search  begins  again.  \  flowchart  describing  'hi'^ 

optimality  procedure  is  shown  m  [12] 

Consider  optimizing  AT‘-  First,  u.se  ibe  above  procedure  (o  find  a  solution  with  optima! 
e.xeciitioo  time  T,  which  uses  Pj  processing  elemeois.  Then,  let  the  largest  dimension  of  the 
input  or  output  matrix  be  ( o  x  n)  and  assume  that  the  smallest  number  of  PKs  that  r.an  be 
used  in  a  solution  is  n’.  Then,  solutions  with  completion  time  T;  such  that  q  Tj  >  PiTf 
i.e.. 


o 


need  not  be  consi<Ierrd  and  the  search  is  carried  out  only  for  v.ilues  ■mailer  (h.an  T  I  he 
reason  for  ignoring  designs  which  have  completion  times  greater  (ban  T.  is  that,  if  T* 
multiplied  by  the  minimum  possible  number  of  PRs  is  gre:iter  than  the  AT"  mexsiire  for  the 
Tj,  then  the  solution  cannot  have  .a  smaller  AT'  measure  When  all  possible  designs  wuh 


316 


M  T  0  Keefe  and  J  ,1  I  ortes 


exfcution  time  between  T|  and  T.  have  been  found,  the  AT"  mea.vure  iv  compared  to  find  the 
minimum  solution. 

This  optimization  procedure  has  been  applied  to  find  optimal  systolic  arrays  for  matrix 
multiplication,  FIR  hitering,  discrete  Fourier  transform  and  other  algorithms.  |9|. 

2.2  Data  Dependency  Method  [l]-[8] 

Let  Z‘  denote  the  nth  cartesian  power  of  Z.  the  set  of  nonnegative  integers  To  describe 
an  algorithm  A.  a  Ove  tuple  A  =  (J*  .  C  D  .  X  .  Yl  is  used  where  J°  C  Z"  is  the  index  set. 
C  IS  the  set  of  computations.  D  is  the  set  of  iJependence  vectors.  X  is  the  set  of  input 
variables,  and  Y  is  the  set  of  output  variables  The  data  dependencies  describe  the  structure  of 
the  algorithm  and  are  given  as  a  set  of  irijtles  (d.v.j)  such  that  the  computation  indexed  by  j 
requires  the  variable  v.  generated  at  index  j  -  d.  as  an  operand 

.,\s  an  example.  con.sider  the  two-dimensional  recurrence 

^ii  j;)  ~  /  [  a<j|-l.j;  -t- 1 ),  a(j|-l  .j  >-l  )|  .  0  <  j,  <  (  .  0  <j.  <  I.  where  /  Is  some  function 
We  can  describe  it  as  A  =  (J"  C.  D.  X.  Y)  where  J"  =  (  j  0  <  j,  <  4,  i  -  l.J  )  .  C  is 
the  set  of  all  computations  on  the  right-hand  side  of  ihe  recurrence  equ.ation.  i  e. 
C  =  (  /  (  a(  j,-l  .  j.  1  )  .  af  j|-|  .  j,-l  I  I  :  (  j|  J,  J"  1  and  D.  a  set  of  triples 
(d.v.j),  can  be  described  by  a  matrix  whose  columns  correspond  to  the  6rst  element  of  each 
triple  and  v.  j  need  not  have  an  explicit  representation  Thus,  the  columns  of  D  correspond  to 
the  vector  difference  between  and  the  indices  of  the  references  lo  a  on  the  right-hand 

side  of  the  recurrence.  ( j|~  1  ij  I  )*"  and  (Ji~ I  .  which  yields 


D  = 


1  I 
-1  1 


1-10 

0 

0  1  1 

and  Cj  - 

-1 

-1  0  1 

0 

X  IS  the  set  of  input  variables  and  Y  is  the  set  of  output  variables,  i  e  . 
X  =  |a(-I.l,).-1  r  ...'.3)  Lfalli.U):  0  <  I,  <  .3. 1- = -l.i  |.  Y  =  {a(j|.j.) .  0  <  j|.j,  <  l| 

Linear  indexing  functions  |l|  describe  how  variables  are  referenr^.  \_linear  mdexing 
function  F;J*  —  Z™  is  dehned  by  an  equation  of  ihe  form  F(  j  I  =  C,  +  Cj  where 
C,,  6  Z'"' "  is  called  the  index  displacement  and  C  €  Z"”'“'  is  railed  'be  indexing  matrix  For 
example,  the  variable  .a(j,  -  j_,  j-  jx  -  1.  Jx  "  jJ  has  a  linear  indexing  function  for  which 


C  = 


Data  broadcasting  is  not  allowed  in  systolic  arrays  and  the  data  dependency  method  can 
detect,  remove  or  reduce  broadcasts  from  algorithms  to  be  implemented  on  systolic  arrays  (l|. 
During  the  execution  of  an  .algorithm,  a  variable  needs  to  be  broadcasted  if  and  only  if  both  of 
the  following  conditions  are  satished:  (11  at  least  two  computations  use  the  variable  and  (^1 
such  computations  are  scheduled  for  execution  at  the  same  instant  of  time.  To_(letermine  if 
the  first  condition  is  satisfied,  it  is  clear  that  a  variable  with  indexing  function  F  is  used  by 
computations  indexed  by  j  and  P  if  and  only  if 

F(  r  1  =  F(  p  1  ,  i.e..  Firi  =  (1 
' .  From  the  definition  of  F  equ.  (2.2.1)  can  be  rewritten  as 


(2.2  1) 


where  F  = 


cr 


0 


(2.2.2) 


In  essence,  the  dependency  method  finds  a  reindexing  transformation  which,  when  applied  to 
the  original  algorithm,  yields  a  new  r, -equivalent  algorithm  which  maps  easily  into  a  systolic 
array.  .A  transformation  matrix  T  can  be  used  to  describe  a  linear  bijection  which  transforms 
the  dependency  matrix  and  index  set  of  an  algorithm  so  that  it  can  be  executed  in  a  VLSI 
array.  T  can  be  partitioned  into  two  matrices,  ir  and  S: 


T  = 


The  r  matrix  defines  the  time  transformation  whereas  S  defines  the  space  transformation  to  be 
applied  to  the  dependence  matrix  and  index  set  of  aa_  algorithm.  The  time  at  which  a 
computation  indexed  hy  j  is  executed  is  determined  by  r  j.  while  S  j  specifies  which  processor 
is  to  execute  this  compulation.  In  other  words,  the  transformed  equivalent  algorithm  is  such 


fv^o  ^ivsternaiic  Jestifn  melhodoloj^tes  ftjr  s\ smite  jrra\  \ 


;r 


that  the  first  coordioaie  of  thf  index  of  any  roroputation  determines  us  exenitioQ  nine  ;imi  (he 
remaining  coordinates  determine  which  processor  is  to  be  used  Of  rourse  n  ;in<l  S  idu'*! 
satisfy  certain  conditions  if  they  are  to  he  considered  valid  ir.insfnrm.ition**  Let  tn  he  the 
number  of  columns  in  the  _  dependency  matrix  Time  t  ransforniac  mnn  musi  "iijsfv 
ird.  >  0  .  i  =  l ..m,  where  d,  is  a  column  vector  in  the  dependence  ni.nrix  Lhi'.  roust  r  uin 
results  from  the  requirement  that  a  variable  idum  !>e  generaie<l  txfof'*  ii  -is.-ii  in  v 
computation  The  time  of  execution  of  a  roroputation  with  iroiex  j  is  s;ven  by 

j )  =  I  le  ■i'‘i  I 

I  disp»^ 


where  disprr.  _tbe  displacement  of  the  ordering  determined  by  t,  mu>t  ''aiisfv 

dispff  <  miniTd,  :  i  =  l m).  Intuitively,  the  displacement  describes  the  number  of  parallel 

wavefronts  that  simultaneously  sweep  over  the  index  set  to  complete  the  computaiion  In  ihi*« 
paper,  unless  otherwise  stated,  the  displacement  is  considered  to  be  one.  Mnce  the  parameter 
method  considers  only  this  case. 

The  •♦pace  transformation  S  maps  the  computation  indexe<l  by  j  iii(o  processor  Sj  riiis 
assumes  a  processor  array  model  consisting  of  a  grid  which  has  the  d»men*‘ion  ility  tT  ttu*  arr  i> 
Each  point  of  the  gr.d  corresponds  to  a  processor  and  the  ctwrdinaies  of  the  point  irc  t  h«’ 
index  of  the  processor  Certain  resirictions  must  placed  «>□  possible  vilutions  for  S  d'l*’  o; 
the  limited  imercoaneclions  available  in  arrays.  These  restrictions  can  be  emhodie*!  m  »he 

P  and  K  matrices  Ihc  P  matrix  describes  the  interconnection  primitives  vvailaLie  witlin  ni 
array,  i.e..  ihe  vector  differences  between  in<lices  of  connected  processors  for  i 

■'qu.are  arr.ay  with  onlv  rjort h-'*ouih,  e.-ist-wesi.  nearest-neighbor  conn^rtion''  'voukl  ho-  ’h>' 
following  P  mat  nx : 


P 


b  -I  0  10 

b  0  -l  0  I 


where  each  column  of  P  describes  one  interconnection  priimtive  (o  Ke  used  to  s»  i|,i  ,,r  remv** 
data  from  a  next-neighbor  processing  element.  The  primitive  with  all  rero  enirtr-  :niii<  .iii-' 
that  a  variable  can  also  be  stored  »□  the  pr<Kessor  The  uiiliration  matrix  K  describe*'  i  h<‘ 
inicrcoDoections  used  bv  the  transformed  algorithm  during  execution  The  rr/a.'.r  n^hip  '!>f/ttrrii 
K.  P.  S.  and  D  is 


SD  -  PK  l  .’J  1 

where  the  entries  of  K  must  satisfy  the  following  constraint 

<  Jrd  ■  i-I-  .m  i  -  J  M 

i  =  > 

This  last  constraint  requires  that  the  time  between  the  generation  and  u'^e  of  a  vanabk  mM''i 
be  greater  than  or  equal  to  the  uumber  of  interconnection  primitives  needed  by  the  datum  '*< 
travel  from  the  PE  in  which  it  was  generated  to  the  PK  m  which  it  will  be  used  In  f:wi,  the 
inequality  lo  (2  2.1)  can  be  replaced  by  equality  If  the  number  of  primiiivrs  m  less  than  (he 
time  allowed  for  comm uoication.  (he  datum  must  be  stored  for  the  remaining  time,  thus  using 
the  all  zeros  primitive.  An  additional  constraint  can  be  added  that  reflects  (be  limited  control 
available  within  the  simple  PEs.  This  means  that,  m  general,  data  must  travel  along  the  same 
direction  as  it  flows  through  the  array.  Thus,  only  one  entry  in  each  column  of  the  K  matrix 
can  be  nonzero.  These  restrictions  can  be  relaxed  lo  reflect  advances  lo  V  LSI  lerbnology 

The  design  problem  in  the  data  dependency  method  can  be  formulated  as  follows  find  a 
suitable  tr.  which  then  defines  possible  solutions  for  K  (2.2  4)  Examine  the  solution  lor 
solutions)  for  S  corresponding  to  each  K  and  determine  which  S  requires  the  smallest  number 
of  PEs.  A  procedure  is  available  for  finding  the  optimal  ir  to  terms  of  (be  smallest  execution 
time  (5).  There  is  no  guarantee  that  a  solution  for  S  in  the  equation  SD  =  PK  exists  for  ihe 
matrices  K  associated  wuh  the  optimal  w  if  no  solution  for  S  can  be  found  for  the  opiiraal  n. 
certain  heuristics  ('^j  can  be  applied  to  6n<l  .i  sulKiptimaJ  w  so  that  a  spacr  •  ransform.-vi ion  S 
exists.  However,  note  that  these  results  refer  to  a  space  of  solutions  where  disp  >r  may  be 
equal  to  or  larger  than  unity.  This  greatly  complicates  the  opticiization  procedure  [.i| 
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m  -  Equivalences  between  the  Parameter  and  Data  Dependency  Methods 

The  two  QQetbods  discussed  io  this  p%per  each  rootaiD  sets  of  equations  which  describe  the 
flow  of  data  in  a  systolic  array.  The  data  dependency  and  parameter  methods  have, 
respectively,  the  space  equations  (2.2  3)  and  the  systolic  processing  equations  (2.1  9-2  1.11).  In 
the  following  analysis,  the  relationships  between  the  two  sets  of  equations  will  be  established. 
Lemma$  1-3  provide  equivalences  between  the  different  parameters  of  the  two  methods,  while 
Lemma  4  describes  the  form  of  the  dependency  matrices  for  algorithms  considered  in  the  data 
dependency  method.  These  lemmas  are  then  applied  in  Theorem  1  to  show  that  the  space 
equations  and  systolic  processing  equations  are  equivalent.  The  proofs  are  omitted  here  but 
can  be  found  in  [12| 

The  first  lemma  gives  expressions  for  the  data  distribution  and  vrlocity  vectors  of  the 
parameter  method  in  terms  of  the  transformations  and  indexing  matrices  of  the  data 
dependency  method. 

Lemma  1 

Let  S.  ir  be  as  defined  previously  in  section  2.2,  and  let  v  be  any  of  (he  vririables  /.  y, 
as  given  for  the  parameter  method.  Also,  let  C''  represent  the  indexing  matrix  for  variable  t* 
Then  the  following  relationships  hold  for  the  two^dimensiocaj  case: 

(„.h'  fill 


±1 

0  =  v,  ,  s 


±1  -  V, 

0 


For  I  hr  onr-iliitirnsiooal  r.w.  (he  followiag  relationships  apply 


C"  ' 

±1 

C’ 

r 

0 

w 

0 

II 

(A 

IT 

±1 

The  next  lemma  deserihes  (be  rrladoosbip  between  the  ir  veeior  of  ihe  data  dependeney 
(netbod  and  (he  periods  (,.  ij.  (,  of  (he  p.arametrr  method.  The  relationship  will  prove  lo  be 
remarkably  simple. 

Lemma  2 


ir  =  t,  t,  t. 


Thus,  (he  periods  of  (he  p.arameter  method  are  the  eletnents  of  (hr  ir  ma(rix  Thr  next 
lemma  relates  the  elements  of  the  data  dependency  method's  K  m.atnx  and  the  eonstants. 


I*.  IIS'S  11.  defined  in  equalton  (21. '.’01.  le  .  |t.|  j  =  k,  .  j 
j  (|j  I  x.j|  =  k,  .  Let  k,  be  the  single  nontero  entry  of  the  I'th  column  of  K 

Lemma  S 


ly.fl  = 


The  next  lemma  describes  (be  form  of  (be  dependency  mat  rices  for  (be  class  of  recurrences 
considered  in  the  parameter  method. 

Lemma  4 

The  dependency  matrices  for  the  class  of  recurrences  considered  in  the  parameter  method 
have  the  following  structure: 

Two-dimensional  Recurrence: 

±1  0  0  I 

D  =  0  il  0  I  d,  .  3, 

0  0  ±1 


ll 
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Two  svstemattc  design  methodologies  for  systolic  arravs 


One-dtmension&l  Recurrence: 


where  dj,..,d,  are  dependency  vectors  which  are  a  function  of  ihe  recurrence,  as  is  r.  the  total 
number  of  these  additional  dependencies. 


The  following  theorem  shows  that  the  equations  used  in  both  methods  to  describe  ihe 
operation  of  a  systolic  array  are  equivalent. 


Theorem  I 


The  constraint  equations  (2.1,9  -  2  1.14)  of  the  parameter  method  are  equivalent  to  the 
«pace  equations.  SD  =  PK.  of  the  data  dependency  method. 


IV  -  Optimtsation  Procedures  and  Examples 


Optimisation  Procedures 

Opiimiraiion  procedures  for  the  parameter  method  were  discussed  previously  in  ■'•'ction  11 
By  directly  translating  the  parameters  .and  coastraints  of  this  metho<i  into  the  corresponding 
clemenia  of  the  dependency  method,  we  can  devise  a  similar  procedure  which  is  applir.abic  'o 
the  recurrences  consulered  in  (Oj.  However,  by  using  a  slightly  dilferent  approach,  it  po*<*‘i}){e 
to  propose  a  related  optimisation  procedure  applicable  to  all  cases  for  wlnrh  disprr^l  m  'he 
ilependeocy  method.  It  differs  from  that  proposed  for  the  parameter  method  in  that  it  checks 
ad  possible  values  of  K  before  considering  longer  execution  times  (i  e  .  different  .t  s|  7‘he 
flowchart  of  Figure  2  describes  the  new  opiiuttation  procedure.  In  words,  it  starts  by  Gotliug 
all  tr.ansformations  t  which  minimiie  execution  time  This  is  relatively  e.asy,  since  only  the 
case  dispjr  =  l  is  considered  and  execution  time  is  therefore  a  monotonic  function  of  the  entries 
of  T.  Hence,  one  can  start  with  all  entries  of  t  being  lero  an<l  progressively  incre:ise  fheir 
.absolute  values  consulering  all  possible  combinations  of  signs  and  magnitudes  (while,  of  course 
checking  for  the  validity  of  each  ff).  Possible  s.  which  might  result  from  further  incr^-xses  m 
the  .absolute  value  of  the  entries  of  a  particular  ir  for  which  execunon  time  is  larger  than  the 
known  mioiraura.  need  not  be  considered  due  to  moootoniciiy  property  mentioned  above 
Thus,  the  search  space  is  finite,  ami.  m  fact,  rather  small  for  most  cases.  Once  the  set  of  j  s 
known,  it  is  necessary  to  check  if  there  exists  a  solution  to  the  equation  SD  =  PK  for  at  le.asi 
one  of  the  possible  values  of  K.  If  a  solution  is  found,  then  (he  corresponding  .t  (as  well  as  ((]«» 
design  determined  by  t  and  S)  is  optimal  with  respect  to  execution  time  Otherwise,  a  new  set 
of  t’s  must  be  found  which  increase  execution  lime  by  the  least  amount  and  the  process  is 
repeated  again.  7'he  procedure  always  terroinaies,  since,  in  the  woe'll  case,  serial  execution  is 
reached  as  a  feasible  solution. 


.•\  similar  reasoning  can  be  used  to  optimise  measures  combining  area  and  execution  time, 
e.g.,  AT  or  AT^.  Figure  3  illustrates  such  a  procedure.  It  differs  from  that  of  figure  2  in  that 
the  search  space  is  reduced  to  the  set  of  t’s  which  result  id  execution  time  bounded  above  by 
T;  as  given  by  (2.1.24).  In  ibis  finite  space,  all  valid  values  of  rr  and  S  are  considered  and  those 
which  optimize  the  combined  measure  of  .area  and  time  determine  the  optimal  solution  This  i'< 
exactly  ihe  same  approach  used  in  the  parameter  metbcKl  }'be  key  idea  consists  of  limiung 
the  search  space  by  choosing  bounds  for  v  and.  thus,  for  the  execution  time.  .Many  ditfereni 
criteria  can  be  used  to  choose  the  bounds.  For  example,  in  {![].  the  same  approach  is  used  and 
T  is  bounded  by  limiting  the  values  of  irdjwbicb  can  be  tbougbt_of  as  the  number  of  buffers  for 
the  data  associated  with  the  dependence  d)  for  all  dependeDcies  d  in  the  matrix  D 


Examples  •  Systolic  Designs  for  the  Convolution  Algorithm 

Convolution  can  be  expressed  as  the  following  recurrence  equation 


yk  3  y  k  ^  1  <  I  <  n.  1  <  k  <  m.  Xj  -  0  for  j  >  n 


.Another  possible  description  with  the  order  of  access  of  the  input  terms  reversed  is 
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Optlmitalion  procedure  for  obtaioing  a  systolic  array  clesigo  •ith 
executioo  time  (T)  using  the  depeodeocy  method. 


l<i<D.  I<k<m.  D>m 


11  2) 


Design  using  the  parameter  method 

Two  cases  were  considered  for  the  convolution  design  problem  (12],  but  only  one  is  shown 
here.  Letting  m=4.  n=0,  the  first  case  will  have  periods  t,  =  I  and  Ij,,  =  ll,  =  -1. 
Substituting  these  values  into  the  systolic  equations  results  in  four  equations  in  S  unknowns. 
For  FIR-filtrring  applications,  the  a,s  are  constants  that  can  be  loaded  into  nr  fixed  in  the  PEs 
before  ihe  computation  begins.  Thus  a.)  =  0  and  from  the  systolic  equations  a,  =  t|,^*,j  =  ~y^ 
To  achieve  the  fastest  solution  ^4  can  be  set  to  I  or  -I.  .\  possible  solution  is  shown  in  Figure 
3  Four  PF.s  are  required;  m  +  n-|=9  time  units  are  needed  for  computation  and  the 
preloading  of  values  x,.  .  .  .  ,x„  The  time  to  completion  is  therefore  2m  +  n-l  =  13  lime 
units. 

Design  using  the  dependency  method 

•Now  consider  ihe  problem  in  terms  of  ihe  data  dependency  method.  To  do  so.  the 
dependency  matrices  that  are  valid  for  recurrences  (4.1)  and  (4.2)  must  be  found.  Pipelining 
the  variables  in  (4  1)  and  (4.2),  we  have,  respectively. 
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Figure  2  “  Optimizailoo  procedure  for  obtainmg  a  systolic  array  design  wiib  mioinium 
area. execution  tine  (AT)  using  the  dopeodency  metbod. 


which  yields  the  following  form  for  allowable  dependency  matrices,  respectively. 


D, 


±l  0  ±1 

±1  0 

±1 

0  I  T1 

and  D;  =  j 

0  1 

±I 

.Note  that  the  only  difference  between  D,  and  D;  n  that  elements  in  the  dependency  vector 
for  /  roust  have  different  signs  for  D,  and  ibe  same  signs  for  D-  \rcorHing  to  the 
dependency  method,  the  first  jr  to  be  ron'videred  is  jt  =  [I  \\-  Multiplving  w  by  D,  and  D- 
yields  jtDi  =  \\  1  0],  irD.  =  (I  1  ‘2\.  The  zero  entry  in  jtD,  indicates  the  tt  ^elected  violates  ibe 
dependencies  of  recurrence  (M)  as  it  is  necessary  to  provide  broiidcasts.  Thus,  the  recurrence 
(t.2)  is  selected  and  the  space  transformation  corresp\.iiding  to  the  systolic  array  of  figure  1  is 
=  [0  -I I  which  yields  =  (  0  "I  ”1  1 


Verification  that  both  methods  yield  the  tame  design 

Lemma  2  can  be  easily  verified.  i  e  .  t  =  f  t,  t^  ]  -  |  I  I  To  verify  that  the  ''pace 
transformation  Si  corresponds  to  the  same  systolic  .array  of  figure  .1  and.  thus,  to  the  velocities 
and  data  distributions  of  the  corresponding  solution  in  the  parameter  metbod.  Lemma  I  can  be 
verified  as  follows 
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Figure  3  -  Systolic  array  for  convolution  •  [9). 
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Syntolie  Design  for  the  Deconvolution  Algorithm 

Deconvolution  is  ihe  inverse  of  FIR  Bitering  and  can  be  expressed  19)  as  the  following 
recurrence  with  temporary  variable  i, 

r,’  =  1  <  i  <  n 

-  a  X 

1  <  k  <  m-1,  1  <  I  <  n.  X  =  0  for  j  >  n 


Design  Using  the  Parameter  Method 

The  paramricr  method  was  applied  to  develop  a  systolic  array  which  performs 
deconvolution  {9|.  The  array  must  perform  division  lo  obtain  x,.  and  x,'s  are  used  in  the 
computation  of  the  t,'s.  The  division  operation  may  take  more  time  than  multiplication,  and 
this  fact  should  be  considered  in  the  design  process.  Assume  the  delay  of  a  division  PE  is  w 
and  the  delay  of  other  PEs  is  1.  This  yields  the  equation  |t,j  =  w  +  I  Analysis  of  the 
feedback  condition  of  datum  x,  yields  an  additional  systolic  equation 

‘4  =  •‘4  -  I, 

These  two  equations  must  be  included  in  the  optimitalion:  letting  w  =  2.  aj  =  0,  a,  =  -  I ,  t,  = 
•3.  and  observing  that  t^  =  t|,,,  a  possible  solution  to  the  constraint  equations  yields  t^  = 
-3/2,  Tj  =  2/3,  i,  =  2.  Xj  =  -2/3,  and  x,  =  -2,  Note  that  the  velocities  of  data  flow  have  been 
averaged  over  three  clocks  cycles.  .A  systolic  array  corresponding  to  these  parameters,  with  m 
=  4,  n  =  S.  is  shown  in  Fig.  4|a). 
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Two  svstemaeic  design  methtxiolof^ttrs  for  sysioiic  jrrax  s 
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Design  Using  the  Dependency  Method 


The  following  drpendenfv  matrix  ran  be  derived  froin  the  renirrence  iomh  for 

decoDVoiuttOQ 
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Kvaminiog  the  rerurrence.  the  cntieal  <lependeace  occurs  between  the  «:*'neraiioa  of  /.T  iiui 
the  use  of  x,,  this  o'-rurs  wnh  k”m-l.  \ielding 


;is  I  he  <lepend*‘!ice  sector  The 


broadcasting  anal.v'»ix  ix  then  applied  to  pipeline  variable  x  .  r»*siiliin'^  in  «j'  |ieiiiii  iir 


Since  division  requires  w  time  units,  n 


JMl.jl 
•ri  or 


=w  +  I  .and  With  w=2.  .t,  -  -3  1  the 


proposed  optimiiation  procedure.  «•.  =  1.  the  smallest  possible  v.alue;  possible  S  vajueM  m-lude 
S  =  |0  l|,  which  opiimues  space.  This  means  that  the  generation  and  usage  of  ^  and  x, 
respectively,  occur  in  the  same  PE.  This  array  is  optimal  with  respect  to  completion  time  T 
ifld  #PE  X  T*  It  could  be  developed  using  the  pararooter  method,  where,  from  I.emtna  T  i,  = 
•3  and  t^  =  1.  A  'systolic  array  which  conforms  to  this  tr  and  S  for  m  =  I.  n  -  is  shown  m 
Fig.  -1(b);  the  completion  time  of  this  array  is  J  jt,  |  (o- 1 )  +  |  ff_>j  (m- 1  •+•  w  i.  The  tieronvoluitoD 
array  of  Fig.  1(a)  haa  n  =  (-3  3/2l  and  S  =  (0  l|-  The  same  process  of  venficaiion  used  for 
coovolufioa  can  be  applied  to  bow  equivalence  between  the  deconvolution  .arrayn  designed 
using  the  different  me(ho<ls. 


New  tyetolic  equetione 


Theorem  /  showed  that  the  systolic  equations  of  the  parameter  meihtxi  are  equiv  ilpnt  to  i  he 
.space  equaiiG.ns,  SD  =  PK.  of  the  data  d^pendenc^  method.  Systolic  equations  for  the  one- 
dimensiooal  case  arc  equivalent  to  Sdy  =  Pky  and  Sd*  =  Pk^.  respectively  The  subscripts  on 
the  vectors  k  and  d  indicate  which  variable  is  associated  with  a  particular  vector  Thus,  ibe 
Sd,  =  Pk,  space  equation  is  not  contained  within  the  systolic  equations  of  the  parameter 
method  for  the  one-dimeD.siooal  case. 

The  systolic  equations  for  (be  i  dependency  will  lake  the  general  form 

/.»d  -  /.’^d 


/.y'd  +  y,  =  /,Xd 


I  i.t) 
nil 


where  J ^  and  are  linear  functions  of  t,  and  t^.  To  determine  the  functions  /,  /,  .  the 

equivalences  defined  in  !.emma§  1  and  2  are  applied  to  (4  13-4  14)  resulting  in 


c* 

PI' 

c* 

-1 

1 

C‘ 

- 1 

0 

/.S 

+  S 

0 

=  /.S 

w 

i 

1 

- 1 

1 

C’ 

1 

C' 

0 

f,s 

)r 

0 

=  /,s 

» 

1 

•  V  V 

-VV 


'■'i 


*.‘«V 

M 

j 

■ 


■ 

^xv; 


rvTT 


>3 


\f  r  O'Keefe  and  J  A  f  ortes 


F.ictoring  out  the  S  and  simplifying  the  abov^  equations  yields* 


c* 

1 

1 

c* 

-1 

0 

Jf 

/. 

r 

/. 

C'' 

1 

1 

C' 

■  1 

0 

IT 

/. 

/. 

Splfcting  rccurrmce  (  12)  yicldr*  C*  =  (l  -l|  ,  C*  =  lO  -l|  .  and  C''  =  ;l  0|  .  then. 
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fsing  these  inverses  in  ( gives  f  ^  +  *k)  /y  “(’•  'k^  e^juations  ( 

1.1  I)  boeonae 


(t.  +  y,  =  (tk  + 

The  above  analysis  i*as  applied  to  the  recurrence  of  equation  (4.21  \  similar  analysis  of  the 
recurrence  expressed  in  equation  (-4.1)  results  in  =  (t^  -  i,)  and  -  i,)  m  equations 

( I. !.■’>)  and  I  1.1 1).  This  new  set  of  systolic  equations,  derived  from  the  data  dependency  method 
ihrough  the  equivalences  described  in  Section  III.  can  be  added  to  the  set  of  systolic  equations. 
From  this  new  -^et.  only  four  equations  are  needed  to  provide  equivalent  'lolutions  to  those 
derived  from  the  original  four  equations. 
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ABSTRACT 

RAB,  a  Reconfiguration  Algoritlirn  for  Bil-levrl 
code,  13  a  large  program  which  systematically  maps  a 
class  of  numerical  algorithms  into  iiit-levcl  processor 
arrays.  This  paper  explains  ttie  purpose  of  UAB, 
outlines  its  overall  organization,  presents  the  underlying 
ideas  and  techniques  of  the  main  components  of  RAH, 
and  discusses  some  implementation  details.  The  input 
to  RAB  consists  of  C  programs  with  word-level 
computations.  Each  arithmetic  operation  in  tliese 
computations  is  first  replaced  by  several  bitwi.se 
operations  (i.e  a  bit-level  expansion)  which  implement 
that  operation.  Dependencies  are  then  detected  in  the 
bit-level  code  and  represented  as  a  dependence  matrix 
which  is  used  in  the  synthesis  phase  of  RAB  to  generate 
an  algorithm  transformation.  In  the  final  map[iing 
phase  of  RAB,  each  bit  level  operation  in  the 
transformed  algorithm  is  replaced  by  a  corresponding 
microprogram  (  i.e.,  a  microcode  expansion),  'i’his 
microcode  is  also  optimized  in  this  phase  to  produce  the 
output  of  RAB,  an  algorithm  executable  on  the 
processor  array.  Currently,  prototype  processor  arrays 
composed  of  several  NCR  Geometric  Arithmetic  Parallel 
Processor  (GAPP)lchips  are  the  targets  for  the  output  of 
RAB. 


I.  INTRODUCTION 

RAD  is  a  program  which  maps  a  class  of  numerical 
algorithms  programmed  in  C  info  bit-level  processor 
arrays.  It  can  be  used  to  derive  a  full  design 
specification  for  an  algorithmically  defined  processor 
array  as  well  as  to  identify  full  (partial)  mappings  of  an 
algorithm  into  an  existing  processor  array  of  lived 
(vari.able)  size.  This  paper  explains  the  purpose  of  RAIt, 
outlines  its  overall  organization,  presents  the  underlviug 
ideas  and  techniques  of  the  main  components  of  U/\B, 
and  discusses  some  implementation  details.  In  order  to 
illustrate  the  concepts  and  operation  of  RAB,  we  show 
how  two  algorithms  for  convolution  are  mapped  into  a 
variable  size  processor  array  composed  of  NCR  GAPP 
chips  (i.e.,  each  chip  is  a  (12  x  6)  processor  array 
iDaTliR-tj). 

'»  -  - - — 

'I'liifl  work  was  svigportcd  in  part  by  the  N.vtion.al  Scienre 
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Itoiov.alire  Srienee  .and  *l‘ee |i noh.gy  (iniee  of  the  .Slr.alevir  Defenae 
Initiative  tlrxanilation  anil  was  administered  thronght  the  Oirue  of 
Naval  Fteaearch  under  contract  no.  00(11  l-SS-k-OSSS. 
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Processor  arrays  generally  consist  of  a  collectiiut  of 
processing  elements  (PE’s)  with  a  regular  intercoiiertinn 
schetne.  Systolic  arrays,  as  characterized  by  I'uiig 
|Kung82|,  are  a  special  case  of  processor  arrays  in  wliich 
data  flows  from  one  PR  to  another  in  a  regular  and 
synchronous  fashion.  Generally,  a  systolic  array  is  easy 
to  implement  and  extend  because  of  its  regularity  and 
modularity.  Due  to  simplicity  of  local  processor  liesign 
and  advances  in  VbSI  technology,  relatively  general 
purpose  bit-level  arrays  are  becoming  common  (i.e., 
GAPP  |DaTh8-lj,  MPP  |Balc80|,  DAP  ll?edd7P|,  Cl. II’ 
[DuWa75|  and  others).  As  compared  to  word-level 
arrays,  bit-level  arrays  require  simple  processing 
elements  (e.g.,  processing  elements  composed  of  a  full 
adder,  some  simple  logic,  and  a  number  of  registers)  and 
provide  high  throughput  rates  (i.e.,  bit  rales).  These 
characteristics  also  make  bit-level  processor  arrays  very 
attractive  for  special-purpo.se  applications,  e.g.,  digital 
signal  processing  (|McMc82).  |Mcetal8-l),  |Mret alS.fj). 

Despite  being  ideally  suited  for  various  applications 
and  VLSI  implementation,  processor  arrays  can  be 
difllcult  to  program  (in  the  case  of  an  existing  general 
purpose  architecture)  or  design  (in  the  case  of  a  special 
purpose  arcbilecture).  This  is  a  particularly  acute 
problem  for  bit  level  sy.slolic  arrays  where  the  goal  is  to 
implement  high  level  computations  (e.g.,  matrix 
computations,  convolution,  etc)  using  bitwi-v  operations. 
In  order  to  solve  this  problem,  it  is  desirable  to  develop 
methodologies  and  tools  which  enable  the  systematic 
mapping  of  algorithms  into  processor  arrays.  In  the 
past,  several  research  efforts  have  been  pursued  in  this 
direction  and  a  survey  can  be  found  in  |Foetal85].  Many 
of  those  methodologies,  which  were  intended  for  word- 
level  processor  arrays,  are  applicable  to  bit  level  arrays. 
Ib'wover,  besidi-s  some  of  the  limitations  that  still 
characterize  those  methodologies,  systematic  bit  level 
designs  present  additional  problems.  RAB  represents  an 
attempt  to  develop  an  automated  tool  for  the  design 
and  programming  of  bit-level  arrays  and  to  understand 
and  solve  the  open  questions  and  problems  involved  in 
this  process. 

In  pr.actice,  [lotential  users  of  processor  arrays  are 
given  an  algorithm  and  must  devise  a  means  for  its 
execution  using  one  of  the  following  options;  (1)  to  use 
an  existing  proces.sor  array,  (2)  to  design  a  special 
purfiose  processor  array,  or  (3)  to  design  an  array  that 
uses  a  number  of  existing  'smaller  processor  array 
modules  as  the  basic  components.  Option  (I)  recpiires 
mapping  of  the  algorithm  into  an  existing  array  taking 
into  consideration  size  limitations,  fixed  interconnection 
schemes,  and  predesigned  processing  elements.  In  this 


option,  which  we  refer  to  as  fail  mapping,  the 
programming  decisions  a.e  totally  subordinated  to  the 
characteristics  of  the  array.  Option  (2)  allows  the  user 
to  design  the  hardware  taking  into  consideration  only 
the  characteristics  of  the  algorithm  and  perhaps  some 
rather  general  VLSI  design  constraints  (i.e.,  planarity, 
limited  pinout,  etc).  This  option  is  referred  to  as  full 
design.  Option  (3)  is  a  compromise  between  full 
mapping  and  full  design,  where  the  designer  can  decide 
the  overall  organisation  (i.e.,  shape,  sise,  interfaces)  of 
the  array,  but  uses  given  basic  blocks  which  are 
themselves’  fully  defined  "small"  processor  arrays.  We 
refer  to  this  option  as  partial  mapping/design. 

The  input  to  RAB  consists  of  C  programs  which 
implement  word-level  algorithms.  In  section  II  of  this 
paper  we  characterise  the  class  of  algorithms  for  which 
RAB  is  intended,  present  the  algorithm  model,  and 
describe  the  representation  of  dependencies  in  an 
algorithm.  RAB  first  expands  the  computations  in  the 
input  program  into  bit-level  operations  as  shown  in 
figure  1.  This  expansion  phase,  which  is  described  jn 
section  III,  replaces  word-level  computations  with  a  bit- 
level  implementation  of  the  arithmetic  operations.  This 
phase  is  followed  by  data  dependence/broadcast  analysis 
which  uses  techniques  discussed  in  section  IV.  The 
results  of  this  analysis  can  be  used  to  generate  an 
algorithm  transformation  which  yields  a  full  design  of  an 
algorithmically-defined  array  or  full  (partial)  mapping 
for  a  fixed  (variable)  sise  array  corresponding  to  the 
third  level  of  modules  in  figure  1.  In  section  V  we 
present  the  methodology  for  the  generation  of  a  partial 
mapping  and  discuss  how  a  full  mapping  or  full  design 
can  be  obtained.  The  last  two  modules  in  figure  1, 
microcode  expansion  and  microcode  optimisation, 
comprise  the  mapping  phase  which  is  discussed  in 
section  VI.  In  section  VII  we  review  the  status  of  the 
implementation  effort  and  present  some  concluding 
remarks  about  the  project. 


U.  ALGORITHM  MODEL  AND 

REPRESENTATION 

Algorithm  Representation 

RAD  accepts  as  input  a  program  which  uses  a 
subset  of  C  constructs.  Since  algorithms  that  run 
efficiently  on  a  processor  array  are  likely  to  have  a 
repetitive  and  regular  structure,  the  input  to  RAB 
consists  of  programs  which  typically  contain  loops.  For 
this  reason,  RAB  is  capable  of  efficiently  analysing 
loop-llke  programs  with  static  behavior.  In  addition  to 
the  fact  that  pointers  and  function  calls  cannot  be  used, 
the  structure  of  the  loop-like  programs  accepted  by  RAB 
must  exhibit  the  following  characteristics: 

the  lower  and  upper  bounds  of  the  outermost 

loop  must  be  integer  constants. 

the  bounds  of  the  nested  loops  must  be  linear 

expressions  of  the  outer  loop  Indices  or  integer 

constants. 

the  step  of  each  loop  must  be  one. 
no  two  loops  can  have  the  same  nesting  level, 
arrays  of  any  dimensions  are  allowed;  the 
range  of  each  dimension  must  be  an  integer 
constant. 

the  boolean  expression  of  a  conditional 
statement  must  be  a  linear  expression  of  the 
outer  loop  indices. 

all  subscript  expressions  used  when  referencing 
elements  of  arrays  must  be  linear  expressions 
of  the  outer  loop  indices. 

Example  2.1 

The  following  convolution  algorithm  is  an  example 
of  a  program  which  satisfies  the  criteria  of  the 
algorithm  representation. 

Mil  =  1;  j,  <=  N,;  j,-l-l-){ 

Mil  =  1;  jj  <=  N,;  ],++){ 

yUil  =  yDil  +  wy  *  xpi+j,-il 

} 

where  w(jj]  is  the  sequence  of  weights,  x|j,-l-j,  -  l)  is 
the  sequence  of  inputs  and  ypj]  is  the  result 
sequence. 

End  of  example. 

Other  programs  such  as  matrix-matrix 
multiplication,  matrix-vector  multiplication,  and  FIR 
and  HR  filtering  satisfy  the  constraints  of  our  algorithm 
representation.  A  broader  list  of  suitable  programs  can 
be  found  in  the  concluding  remarks  of  (Kung82{.  Many 
programs  which  fall  outside  of  this  class  can  be 
transformed  to  satisfy  the  above  constraints,  using  such 
techniques  as  normalization  or  loop  fusion  (Wolf82). 
Conceivably,  such  techniques  could  be  easily 
implemented  in  a  preprocessing  step  for  RAB.  However, 
it  is  assumed  that  input  programs  have  been  normalized 
and  loop  fusion  is  not  needed.  The  next  subsection 
presents  the  formal  definitions  of  dependencies. 

Modeling  Dependencies  in  A.gorithms 

The  parallel  execution  of  independent  operations 
requires  knowledge  about  the  existing  dependencies  in 
an  algorithm  in  order  to  preserve  its  semantics.  There 
are  two  types  of  dependencies  that  can  occur  in  an 
algorithm:  machine  dependencies  and  algorithm 

dependencies  |Kuetal8l|.  Machine  dependences  result 
from  the  limitations  of  the  particular  architecture  used 


Figure  1. 


Flow  diagram  of  RAB. 


for  execution  of  the  algorithm;  algorithm  dependencies 
result  from  the  structure  of  the  algorithm.  The  first 
category  of  dependence,  machine  dependence,  also  called 
resource  dependence,  is  defined  as  follows. 

Definition  S.l  (machine  dependence) 

Statement  Sj,  denoted  as  the  head  of  the 
dependence,  is  machine  dependent  on  statement  Sj, 
denoted  as  the  tail  of  the  dependence,  if  and  only  if 

1.  statement  Sj  precedes  statement  Sj  and 

2.  res(S|)  fl  re3(Sj)  7^  0 

where  res(Si)  denotes  tlie  set  of  resources  needed  to 
execute  statement  Sj. 

Machine  dependencies  can  be  divided  into  two 
categories:  explicit  machine  dependence  and  Implicit 
machine  dependence.  Explicit  machine  dependencies 
result  from  the  apparent  limitations  of  the  architecture. 
For  example,  statement  S;  is  explicitly  machine 
dependent  on  statement  Sj  if  both  statements  require  a 
write  to  two  different  memory  (RAM)  locations  and  the 
given  architecture  only  has  one  RAM  port.  Implicit 
resource  dependencies  are  inherent  in  the  semantics  of 
the  instructions.  For  example,  in  a  GARP  array,  the 
arithmetic  and  logic  unit  (ALU)  of  each  PE  always 
executes  a  "full  add"  operation  every  clock  cycle, 
regardless  of  the  instruction  being  executed.  As  a 
consequence,  the  architecture  of  each  PE  exhibits 
implicit  resource  dependencies  with  the  use  of  the 
calculated  variables  sm,  bw,  and  cy  (which  denote  sum, 
borrow,  and  carry  respectively).  Thus,  if  a  statement 
explicitly  uses  a  calculated  variable,  it  will  always 
depend  on  the  previous  statement. 

The  second  category  of  dependencies,  algorithm 
dependence,  consists  of  the  three  classical  dependencies: 
output  dependence,  data  dependence,  and  anti¬ 
dependence.  These  dependencies  are  defined  in  the 
following  definition. 

Definition  2.S  (algorithm  dependence) 

Statement  S;,  denoted  as  the  head  of  the 
dependence,  is  algorithm  dependent  on  statement  Sj, 
denoted  as  the  tail  of  the  dependence,  if  and  only  if 

1.  statement  Sj  precedes  statement  Sj  and 

2.  one  of  the  following  conditions  is  satisfied: 

i.  out(Sj)  n  out(Sj)  0 

ii.  out(Sj)  n  in(Sj)  yi  0 

iii.  in(Sj)  n  aut(Sj)  yt  0 

where  out(S,)  denotes  the  set  of  output  variables  of 
statement  S,,  and  in(Sj)  ,  the  set  of  input  variables. 

It  is  assumed  that  the  reader  is  knowledgeable 
about  algorithm  dependencies  for  which  an  example 
would  be  redundant.  In  RAB,  only  algorithm  data 
dependencies  are  detected  in  the  dependence  analysis  for 
use  in  the  generation  of  the  algorithm  transformation. 
The  reason  for  the  detection  of  only  data  dependencies 
will  become  evident  in  the  discussion  on  the  algorithm 
transformations.  However  machine  dependencies  and  all 
three  algorithm  dependencies  are  detected  in  the 
microcode  optimisation  module  to  be  discussed  later  in 
this  paper. 

Distance  or  dependence  vectors  provide  a 
particularly  convenient  way  of  representing  algorithm 
dependencies  between  statements  referencing  arrays. 
We  define  the  distance  vector  as  the  vector  difference 
between  the  index  of  the  computation  where  a  variable 
is  used  and  the  index  of  the  computation  where  the 
same  variable  is  generated.  These  vectors  can  be  placed 


in  a  dependence  matrix  which  is  representative  of  the 
algorithm  dependencies  in  a  program.  This  matrix  and 
other  algorithm  parameters  are  essential  features  that 
are  represented  in  the  algorithm  model  defined  below. 

Definition  2.3  (algorithm  tru>del) 

An  algorithm  is  a  5-tuple,  <J“,  C,  D,  I,,  Ov>  where 
J“  6  Z°  is  the  index  set  (Z  represents  the  set  of  all 
integers),  C  is  the  set  of  computations,  D  is  the  set  of 
dependencies  repre.sented  by  distance  vectors,  1,  *■1'^ 

set  of  input  variables  for  the  algorithm,  and  O,  is  the 
set  of  output  variables  for  the  algorithm. 

An  example  of  the  algorithm  model  is  given  in  section 
IV. 

The  dependencies  represented  in  the  dependence 
matrix,  D,  must  be  satisfied  by  the  execution  ordering  of 
an  algorithm  defined  below. 

Definition  3. 4  (execution  ordering) 

A  partial  ordering  is  an  execution  ordering  if  ail 
distance  vectors  are  positive  in  the  sense  of  that 
ordering. 

In  other  words,  the  execution  ordering  of  an 
algorithm  restricts  the  generation  of  a  variable  to 
always  precede  the  usage.  RAB  replaces  an  execution 
ordering  which  is  total  by  an  execution  ordering  which  is 
partial.  Thus  the  original  distance  vectors  represented 
in  the  matrix  D  must  be  positive  in  the  sense  of  the 
lexicographical  ordering. 

Now  that  we  have  presented  our  algorithm  mode) 
and  representation,  we  will  describe  the  various  modules 
shown  in  figure  1.  The  next  section  discusses  the  bit- 
level  expansion  of  the  word-level  computations. 


ni.  BIT-LEVEL  EXPANSION 

The  first  phase  of  RAB  systematically  replaces  the 
word-level  computations  with  bit-level  implementations 
of  the  arithmetic  operations.  These  bit-level 
implementations  are  hereafter  referred  to  as  expansions. 
The  actual  expansion  for  a  given  arithmetic  operation  is 
not  unique.  For  example,  there  are  several  expansion.s 
for  the  multiplication  operation,  e.g.,  Booth's  algorithm 
jBoot51|  or  the  shift-add  algorithm  |Hwan79].  The  bit- 
level  arithmetic  expansions  used  with  RAB  were  chosen 
due  to  simplicity  since  RAB  is  relatively  new  and  in  the 
initial  stages  of  testing.  Conceivably  other  expansions 
can  be  used  with  RAB  to  investigate  the  optimality  of 
different  bit-level  algorithms.  RAB  currently  implements 
the  bit-level  expansions  for  addition,  multiplication, 
division,  subtraction,  and  ail  possible  pairwi.se 
combinations  of  these  operations.  Actually,  RAB 
provides  two  types  of  expansions  for  each  operation  or 
operation  pair.  The  first  typo  is  used  for  running  the 
expanded  algorithm  as  a  conventional  C  program.  This 
provides  the  user  with  a  means  for  gathering  test  dat.a. 
In  the.se  expansions,  slatement.s  are  included  which 
explicitly  convert  the  initial  data  values  to  their  bit-level 
representations.  These  conversion  statements  are  not 
contained  in  the  second  type  of  expansion  which  is  used 
in  the  dependence  analysis  module  of  RAB.  In  order  to 
facilitate  the  analysis,  the  second  type  of  expansion 
eliminates  output  and  anti-dependencies  using  such 
techniques  as  renaming  and  expansion  ([GaelalSl], 
|Kuetal8]|).  Two  different  bit-level  implementations  of 
the  convolution  algorithm  (given  in  example  2.1)  are 
shown  in  figure  2.  These  two  algorithms  result  from  the 


Mil  =  1;  ii  <=  Nii  ji++){ 
for(j,  =  1;  j,  <=  Nj;  jj++){ 
forOs  =  1;  is  <=  N3;  j3++){ 
for(j<  =  1;  <=  N4;  j4++){ 

if(j4  ==  1){ 

cy2liillj2llj3ll0l  =  (cy2|jilij2l|j3-lllO|  &  sumllj,llj,l|j3-ll)l 
('y2ljl|lj2|(j3-ll[0l  ^  SUin2(j,|(j2-l|(j3-l|)j 
(siiml|j,)|j2)|j3-l)  &  sum2|j,)|j2-l]|j3-ll); 

sum2lj,llj2llj3-lj  =  cy2ljillj2l(j3-llM  *  Sliml|j,)|j2l|j3-ll  ‘ 

s<im2|j,llj2-lllj3-ll; 

} 

else 

if({j4  >  1)  &&  03  <  N3)){ 

CyMjlI(j2l(j3l(j4-l|  =  (sUml|j,|(j2l|j3+-j4-2)  &  {w(j2lp3l  &  x|jl +j2-l)(j4-l)))l 

(suml[j,llj2|lj3+j4-2|  &  cylp,|[j2l(j3lljr2l)  1 

((wlj2llj3l  &  xljl+j2-l!p4-ll)  *  CyMjll|j2l|j3l|j4-2l): 

suml[j,|(j2|(j3+j4-2l  =  3umllj,|(j2lp3+j4-2l  '  cyl(j,||j2||j3]|j4-2)  ‘ 

(w|j2llj3l  &  X|jl+j2-lllj4-ll): 

} 

else 

ir(04>l)&&(j3==N3)){ 

Cy2ljlllj2l|j3l|j4-ll  =  {cy2lj|l|j2l|j3llj4-2l  &  SUml  |j  ,]  Ij2l|j3+j4-2|)| 
(cy2ljillj2l|j3llj4-2|  &  SUm2|j,|Ij2-lI|j3^j4-2|)[ 

(siiml(j,!|j2|!j3+j4-2|  &.  3uin2(j,j|j2-ll|j3  1 34-2)); 

Sum2|j,l|j2l[j3+j4-2|  =  Cy2jj,|(j2l|j3llj4-2l  *  suinl |j,)|j2l(j3+j4-2l  * 
sum2lj,llj2-l|[j3+j4-2]; 

} 

} 

} 

} 

} 


Figure  2a.  A  biHevel  expansion  of  the  convolution  algorithm. 


use  of  two  distinct  expansions  of  the  operation  pair  (+, 

The  algorithm  presented  in  figure  2b  is  currently 
used  in  the  expansion  phase  of  RAB.  However,  the  user 
is  not  required  to  input  an  algorithm  into  the  expansion 
module;  facilities  are  provided  whereby  a  user  may 
bypass  this  module  and  input  a  bit-level  algorithm 
different  from  the  one  generated  in  the  expansion 
module.  This  is  the  case  with  the  algorithm  presented 
in  figure  2a.  Both  algorithms  correspond  to  the  second 
type  of  expansion  mentioned  above  and  can  serve  as 
inputs  to  the  dependence  analysis  module  discussed  in 
the  following  section. 

rV.  DEPENDENCE/BROADCAST  ANALYSIS 
The  dependence  analysis  module  detects 
dependencies  between  statements  referencing  arrays.  In 
order  for  a  dependence  to  exist  between  two  statements 
referencing  arrays,  the  following  conditions  must  be 
satisfied: 

1.  the  array  references  in  the  two  statements 
must  have  the  same  name. 


2.  given  that  condition  1  is  satisfied,  the 
functions  which  specify  the  subscripts  of  the 
array  references  must  have  the  same  value 
for  some  index  value(s). 

3.  the  index  value(s)  for  which  condition  2  is 
satisfied  must  belong  to  the  iteration  space. 

This  module  is  invoked  when  the  parser  detects  that 
condition  1  has  been  satisfied.  Kuhn's  Dependence  Arc 
Set  Analysis  (DASA^  technique  [KuhnSOj  is  used  with 
RAB  to  verify  conditions  2  and  3. 

Dependence  Detection 

DASA  utiliies  five  relations  represented  as  convex 
sets  to  gather  information  about  the  possible 
dependencies  and  to  determine  whether  conditions  2  and 
3  are  satisfied.  Dependencies  are  considered  in  relation 
to  the  Cartesian  product  of  the  loop  indices  and  the 
nesting  level  of  the  statements  involved  in  the  possible 
dependencies.  Two  of  the  five  relations,  T  and  H,  define 
the  control  structure  of  the  loops  surrounding  the  tail 
statement  (i.e.,  the  point  where  the  data  is  generated) 
and  head  statement  (i.e.,  the  point  where  the  data  is 
used)  of  the  passible  dependence,  respectively.  Two 


or(j|  =  l:ji  <=  N,;ji  f+)f 
for(jj  =  1;  jj  <=  Nji  j2  K4 ){ 
forOj  =  1;  Ja  <=  N3;  jj  H )( 
for(j4  =  1;  u  <=  N.;  j.  H ){ 

«y|jil|j2l|j3)lj«i  =  (wljjlijsl  &  suin|j,l(j34  j<-ll)  1  (cy|jilljjl|j3l|j<-l]  & 

(wli2llj.1l  !  siiinljillja  lj^-ll)) 

surnljillia  ^  (  wljjljjj)  '  3mM|j,|(j3  ( ‘  <'yliiili2|!iiMJ<-l) 

xlii+j2-llli<!)  i  (sum|j,j|j3  t  &  'xIj, 


Figure  2b.  The  bit-level  expansion  of  the  convolution  algorithm  used 
in  the  expansion  phase  of  RAD. 


other  relations,  and  S^,  respectively  define  the 
indexing  function  of  the  generated  and  used  arrays 
referenced  in  the  tail  and  head  statements.  The  fifth 
relation,  Fd,,  represents  the  forward  relation  which  is 
used  to  test  different  conditions  of  the  loop  Indices  for 
the  existence  of  a  dependence.  The.se  relations  are 
represented  as  convex  sets  in  matrix  format  that  is 
easily  implemented  and  manipulated  in  software.  If  a 
solution  space  results  from  the  convex  analysis  of  the 
intersection  of  the  relations,  T,  H,  F^j,,  and  composed 
with  SJ”*,  then  a  dependence  exists  for  the  conditions 
defined  by  the  forward  relation,  Fii,.  The  inequalities  of 
the  solution  space  of  a  dependence  are  then  ordered  to 
form  an  upper  bound  matrix,  U,  and  a  lower  bound 
matrix,  L,  to  be  used  in  generating  distance  vectors. 
Further  details  about  DASA  can  be  found  in  jKuhnSO) 
and  (Tayl86j. 

Broadcast  Analysis 

An  analysis  scheme  similar  to  the  one  used  with 
DASA  is  used  to  discover  when  a  broadcast  exists.  Since 
only  one  relation  is  required  for  the  broadcast  analysis 
we  will  elaborate  on  this  concept  to  introduce  the  reader 
to  the  representation  of  relations  as  convex  sets  and  the 
analysis  scheme  also  used  with  DASA  for  dependence 
detection. 

A  data  item  requires  a  broadcast  if  and  only  if  the 
datum  is  needed  to  simultaneously  execute  two  or  more 
computations  in  distinct  processors.  In  iFortSI), 
sulTlcient  and  necessary  conditions  for  a  broadcast  are 

providecL  in _ relation  to  an  array  index  fiinctioii, 

F{j)  =  Cj  -F  Cg,  where  C  is  the  indexing  matrix  and  Cg 
is  the  index  displacement.  In  order  to  remain  consistent 
with  the  representation  of  the  relations  involved  in 
DASA,  we  represent  the  array  indexing  function,  F(j),  by 
the  subscript  relation,  S,,  which  is  defined  below. 

Definition  4-1  (subscript  relation) 

Let  the  subscript  relation  of  the  9  dimensional  array 
A  be  represented  by  the  relation 

s,  '  -IS.'  -II  •. 

where  9  represents  the  array  subscripts,  S,'  G  '  °^,  is 
called  the  subscript  matrix,  I  is  the  identity  matrix, 
e  Z(«  ■  ''  is  called  the  constant  vector,  and  n  is  the 
dimension  of  the  iteration  spare. 

The  subscript  matrix,  S/,  is  equivalent  to_  the 
indexing  matrix,  C,  and  the  constaui  vector,  o;,  is 
equivalent  to  the  index  displacement,  Cg,  For  example. 


the  representation  of  the  subscript  relation  for  the  input 
variable  xjji+jj— IIP4  — l]  is  given  by  the  following; 


1100-1  0 


0  0  0  1  0  -1 


110  0 


w  here  S,'=  ^  j  J  and  <t,  = 

According  to  Theorem  3.1  in  (Fort84),  if  the 
ranlf(Sj')  (or  rank(C)  =  n-1)  then  broadcasting  can  by 
eliminated  be  including  the  distance  vectors  for  the 
input  variables  in  the  dependence  matrix  used  to 
generate  an  algorithm  transformation.  These  vectors  for 
the  input  variables  arc  hereafter  referred  to  as  bufTering 
vectors.  The  buffering  vector  is  defined  as  the  vector 
difference  between  the  two  points  of  the  iteration  space 
using  the  same  variable.  This  vector  can  be  generated 
in  the  following  mani’cr.  _ 

Two  computations  indexed  by  j'  and  j"  use  the 
same  variable  if  and  only  if  the  value  of  the  array 
subscripts  arc  the  same.  This  condition  is  represented 

by  the  equation  S,'  j*  —  =  S/  j"  —  rT'  or 

Si' d' -  r")=o  (Ti) 

where  j'  —  j"  =  d  is  the  buffering  vector.  Equation  4.1 
can  be  represented  by  the  following  convex  set: 

fs/  -s/i  [rl  M 


The  intersection  of  the  five  relations  used  in  DASA  is 
represented  in  a  similar  manner.  The  solution  to  the 
convex  set  in  (4.2)  is  found  using  an  analysis  procedure 
similar  to  the  one  used  with  DASA.  First  the  variables 
are  eliminated  using  a  reduction  procedure.  If  a 
consistent  solution  results  from  the  elimination  step,  the 
variables  are  projected  onto  the  space  Z"  x  Z".  A 
detailed  description  of  the  reduction  and  projection 
procedures  ran  be  found  in  iKuhnSO]  and  [TaylSn].  In 
the  projection  step  the  inequalities  defining  the  solution 
space  are  ordered  to  form  the  L  and  U  matrices 
mentioned  in  the  previous  subsection.  The  results  of  the 
broadcast  analysis  for  the  input  variable, 
*|j|dj2“ll|i<~ll  given  below. 


•.*  s."  V*  • 
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The  resultant  L  matrix  is  shown  as  the  coe/Iicient 
matrix  in  the  following  convex  set  which  spccifles  the 
lower  bounds  of  the  solution  space: 


1100-1-10  0 
0  0  0  1  0  0  0  -1 


the  intersection  of  the  control  structure  with  the 
relations  S^,  S„,  and  Fih.  along  with  the  results  of  the 
broadcast  analysis  are  combined  to  form  the  following 
dependence  matrix  (for  .the  algorithm  given  in  figure  2a). 


0  0  0  0  1  1 

0  0  1  0  0  -1 

1  1  0  0  0  0 

0-1010  0 


Similarly,  for  the  upper  bounds  of  the  solution 
space,  the  U  matrix  is  as  shown  in  the  following  convex 
set: 


-1-10  0  110  0 


-1  0  0  0  1 


Similar  matrices  result  from  the  dependence  analysis 
(however,  typically,  I,  v'  —  U).  The  generation  of  the 
dintniict*  VfM'tors  for  tlie  liiptiL  vnrlnlilcH  hihI  the 

generated  variables  involved  in  a  dependence  are 
discussed  in  the  following  subsection. 

Distance  Vector  Generation 

The  distance  vectors  can  be  extracted  from  the  L 
and  U  matrices  by  inspection  and  enumeration.  The 
enumeration  step  is  only  necessary  when  the  elements  of 
the  distance  vector  are  functions  of  the  outer  loop 
indices  instead  of  constants.  This  step  consists  of 
substituting  each  point  in  the  solution  space  of  the 
dependence  (defined  by  the  L  and  U  matrices)  into  the 
given  functions  and  keeping  only  the  distance  vectors 
with  integer  elements  (fractional  entries  cannot  result 
from  the  difference  of  integer  vectors). 

The  buffering  vectors  for  the  input  variables  are 
extracted  in  a  similar  manner.  The  L  and  U  matrices 
resulting  from  the  broadcast  analysis  of  the  input 
variable  x|ji+jj— represent  the  equations 
(ji'-ji")-(jr’-j2'')=0  and  j,'-j,"  =  0.  The 
number  of  buffering  vectors  resulting  from  the  broadcast 
analysis  is  equal  to  n— rank(Sj')  (which  for  our  example  is 
2  since  n  =  4  and  rank(S,')  =  2)  corresponding  to  the 
number  of  free  variables.  For  the  equation  above,  the 
buffering  vectors  are  the  transpose  of  the  vectors  |l  -1  0 
0|  and  |0  0  1  0]  corresponding  to  the  case  when 
p,'-j,")  =  >.  (j3'-j,")=0  and  (j,'-j,")  =  0, 

(jj'— jj")  =  1,  respectively.  These  vectors  are  not  unique; 
the  value  ^1  may  be  used  for  each  of  the  free  variables. 
However,  this  is  not  the  case  with  the  dependence 
analysis.  The  distance  vectors  resulting  from  DASA  are 
unique  since  the  control  structure  of  the  tall  and  head 
statements  are  represented  in  the  analysis. 

The  relations  defining  the  control  structure  of  a 
dependence  in  IIASA  correspond  to  a  stibset  of  the 
iteration  space  or  a  subset  of  J“.  For  the  algorithm 
given  in  figure  2a,  the  iteration  space  is  the  set  of  the 
points  (j,,  j,,  jj,  j<)  e  J"  for  1  <  j,  <  Nj.  The  results  of 


Column  1  of  the  dependence  matrix  corresponds  to  the 
used  variables  cy2|ji)lj})lj3— l]l0)  and  the  input  variables 
111)4— 1  (  column  2  corresponds  to  the  used 
variables  suml  2j,  column  3  corresponds  to 

the  used  variables  9um2|ji)jjj— l)[j3-f-j^— 2j,  column  4 
corresponds  to  the  used  variables  cy2[j||[j2]|j3j|j4— 1|  and 
<=yl|ii)|j2)!j3)lj4“2)  and  the  input  variable  w|j,j|j3),  and 
the  last  two  columns  correspond  to  the  input  variables 
"'lisllial  and  xjji-fjj— l||j^— 1|.  The  set  of  computations, 
C,  used  in  the  algorithm  in  figure  2a  correspond  to  full 
add  operations.  The  set  of  output  variables  for  this 
algorithm  consists  of  the  results  of  summing  the 
products  that  is  stored  in  the  array  sum2.  The  set  of 
input  variniiles  consists  rtf  tlin  wciglits,  tnput.s,  snd  the 
sums  smi  csrrlcs  wiitrli  wcfs  never  generstei). 

The  algorithm  model  parameters  for  the  algorithm 
given  in  figure  2b  are  similar  to  the  ones  presented 
above  with  the  exception  of  the  dependence  matrix,  D, 
given  below. 


0  0  0  1  1  0 

0  0  10-10 

0  1  -jj  0  0  1  ’ 

1  -1  jj  0  0  0 


l.  -.Ns  given  N3  <  N4. 


Column  1  of  this  dependence  matrix  corresponds  to  the 
used  variables  cy[ji|fjjllj3l|j4— 1]  and  the  input  variables 
wljjlfjsl,  columns  2  and  3  correspond  to  the  used 
variables  sum|jj jOs-fj^- 1),  and  the  last  three  columns 
corresponds  to  the  input  variables  w(j2](j3)  and 

xljl+j2-lllj4l- 

The  distance  vectors  for  the  generated  data  items 
are  used  in  the  synthesis  phase  to  preserve  the  semantics 
of  the  program;  the  buffering  vectors  for  the  input  data 
items  are  included  in  the  dependence  matrix  in  an 
attempt  to  schedule  different  execution  times  for  the 
computations  requiring  the  same  variable.  The  next 
section  describes  the  methodology  used  to  generate  an 
algorithm  transformation  for  a  variable  sire  array. 


V.  TRANSFORMATIONS 

The  synthesis  phase  of  RAD  utilizes  a  well  known 
transformation  methodology  hereafter  referred  to  as 
Linear  Algorithm  Transformation  (LAT  .  This 
methodology,  which  is  described  in  ([FoPa84|,  FoMo85|, 
and  referencoB  ..therein)  generates  a  translormation 

matrix,  T  =  g  ,  which  maps  the  index  points  of  the 

bit-level  algorithm  into  the  space-time  domain.  The 
LAT  methodology  uses  the  dependence  matrix,  D,  to 
insure  that  generated  and  input  data  are  available  for 
usage  by  the  scheduled  PE  at  the  scheduled  time  of 
execution  for  a  given  compulation.  Due  to  this  fact, 
only  distance  vectors  for  data  dependencies  and 
buffering  vectors  for  input  variables  are  extracted  in  the 


I 


dependence/broadcast  ana.lysis  module.  The  first 
component  of  T,  tt  e  Z"'”*,  corresponds  to  the  time 
transformation;  the  second  component,  S  e  Z'’“~*' ' 
corresponds  to  the  the  space  transformation.  These 
components  are  described  in  the  following  subsections. 

Time  Transformations 

The  linear  time  transformation,  rr  g  maps  the 

index  set  of  the  algorithm  into  the  unidimensional  lime 
space,  Jr:J“ — ►  t.  Given  the  time  Iransformalioa,  tt,  the 
time  of  execution  of  a  computation  indexed  by  j  is  given 
by; 

TO -[4^1  (»■') 

disp  n 

where  disp  rr  =  min{7rd,,  dj  6  D}  (dj  corresjjqiids  to  the 
ilh  column  vector  in  D)  and  O  =  — min{7rj:  j  g  J°}  I. 
The  constant  O  forces  the  first  computations  to  be 
executed  at  time  t=l.  The  parameter  disp  tt  repre.sents 
the  maximum  number  of  parallel  arithmetic 
computations  executed  in  each  processing  element.  We 
restrict  the  value  of  disp  tt  to  one.  This  restriction  is 
representative  of  the  systolic  array  u.sed  with  RAB 
(GAPP)  and  some  other  available  architectures  (i.c., 
MPP,  DAP,  CLIP).  Given  disp  tt  =  1,  the  total 
execution  time  of  an  algorithm  is  represented  by  the 
expression 


1  h  (Ni  -L.)  6 

i-i  ' 


where  N,  and  L;  correspond  to  the  upper  and  lower 
bounds  of  the  loop  variable  j,  respectively  and  6 
represents  the  number  of  clock  cycles  needed  for  the 
execution  of  the  arithmetic  computations.  To  insure 
that  the  ordering  determined  by  tt  is  an  execution 
ordering,  we  impose  the  restriction  that  rrd^  >  0  for  all 
di  e  D. 

The  time  transformation,  tt,  is  found  by  trying  to 
minimise  the  function  ^.2)  which  is  monotonic  in  terms 
of  the  entries  of  tt.  Due  to  the  monotonicity  of  the 
function,  we  use  a  heuristic  approach  to  generate  tt, 
similar  to  the  one  presented  in  lOKFoSfi).  We  start  with 
all  entries  of  tt  being  icro  and  progressively  increase  the 
sum  of  the  absolute  value  of  the  entries  of  each  tt.  All 
possible  combinations  of  signs  of  each  tt  are  considered 
with  the  exception  of  those  obtained  by  negating 
previously  generated  rr's.  We  then  check  the  validity  of 
each  of  the  tt's.  The  valid  time  transformations,  i.e. 

those  for  which  7rd|  >  0  for  all  dj  g  D,  are  ordered 

according  to  the  execution  time  (.I.Z).  Possible  tt's, 

which  might  result  from  further  increases  in  the  absolute 
value  of  the  entries  of  a  particular  tt,  for  which 
execution  time  is  larger  than  the  known  minimum,  need 
not  be  considered  due  to  the  monotonicity  property 
mentioned  above.  The  ordered  list  of  tt's  is  used  to 
generate  the  space  transformations. 

Space  Transformation 

The  space  transformation,  S,  determines  the 

spatial  mapping  of  the  algorithm  into  the  systolic  array. 
This  mapping  requires  knowledge  of  the  essential 
characteristics  of  the  given  architecture.  These 
characteristics  are  represented  in  the  architecture  model 
defined  in  the  following  definition. 

Dr finition  5.t 

The  systolic  architecture  is  a  four  tuple  <1^'’,  P,  R, 
T>  where  L‘'  is  the  index  set  of  the  processor  array,  P  is 


the  matrix  of  interconnection  primitives,  R  is  the  set  of 
resources  available  in  each  PE,  and  T  is  the  local 
execution  time  of  a  computation. 

Each  point  P  g  L'’  corresponds  to  the  relative 
location  of  a  processing  clement  in  the  systolic  array. 
The  matrix  of  intenronnecljon  primitives  is  juclv_tha(_  if 
p  g  P  then  for  any  P  g  L’’,  P  is  connected  to  P'  =  P  +  p  if 
g  L**  and  P  is  connected  to  an  input-output  port  if 
P'  L"*.  The  set  consisting  of  the  resources  available  in 
each  PR,  R,  is  used  in  the  microcode  optimiration  phase 
to  detect  machine  dependencies.  If  two  instructions 
require  resources  beyond  those  given  in  R  (i.e.,  the  case 
when  two  RAM  ports  are  required  and  only  one  port  is 
given  in  R)  a  machine  dependence  exists  between  the 
two  statements.  The  local  execution  time,  T,  repre.sents 
the  worse  case  time  for  the  execution  of  an  instruction. 
This  value  is  used  to  calculate  h  in  (5.2)  to  determine 
the  worse  case  execution  time. 

The  two  parameters  of  the  architecture  model,  L"’ 
and  P,  define  the  global  topology  of  the  systolic 
architecture.  The  other  two  parameters,  R  and  T, 
define  the  local  architecture  of  each  PE  comprising  the 
systolic  array.  The  parameters  of  the  systolic 
architecture  for  the  GAI’P  array  (shown  in  figure  3)  are 
given  below. 

Example  5.1 

The  GAPP  array  shown  in  figure  3  is  a  syslolic 
architecture  which  can  be  repre.sented  by  the  four- 
tuple  <L‘*,  P,  R,  T>  where  the  index  set  of  one 
chip  is  given  by 


=  (Ci-  y--  J  <  Pi  <52,  1  <  P,  <  6 
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Figure  3.  Reiiresenlatlon  of  GAPP  inlerronnerlion 
scheme  jnaTh8l|, 
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The  matrix  of  interconnection  primitives  is 


The  set  of  shared  resources  available 
processing  element  is  given  by 


R  = 


cm  reg.,  ns  reg.,  ew  reg.,  c  reg.,  ram  port,  ALU  i 


and  the  worse  case  execution  time  of  a 
computation,  T,  is  assumed  to  be  three  clock  cycles 
(one  clock  cycle  to  place  data  in  proper  registers 
and  execute  a  "full  add"  operation,  and  two  clock 
cycles  to  place  data  in  the  proper  register  for 
shifting  purposes), 
end  of  example. 

In  mapping  an  algorithm  into  a  systolic  array,  the 
main  goal  is  to  insure  that  the  data  communication 
between  processors  can  be  accomplished  using  the  given 
interconnection  primitives.  In  other  _  words,  if  a 
computation  performed  by  processor  at  time  t| 
depends  on  data  generated  by  processor  Pj  at  time  tj, 
then  there  must  be  a  composition  of  interconnection 
primitives  that  connects  Pj  to  P]  in  time  t,  —  tj.  The 
composition  of  interconnection  primitives  is  given  by  the 
matrix  K  e  To  insure  that  a  direct  path  is 

taken  for  the  movement  of  data,  we  restrict  all  entries 
in  a  column  of  K  to  have  the  same  sign.  Given  these 
parameters,  the  spatial  transformation,  S,  must  satisfy 
the  following  set  of  diophantine  equations 

SD  =  PK  (5.3) 

where  S  e  Z'"  "  D  €  Z("  PeZ''*'''',  and 

K  c  7.('  ’ 


where  S  e  Z'"  "  %  D  €  Z'”  PeZ''*''^  and 

K  e  zf'  ’ 

The  sum  of  the  absolute  value  of  the  entries  of 
column  i  of  the  K  matrix  represents  the  total  number  of 
data  movements  for  the  corresponding  data  item 
associated  with  column  i  of  t^  dependence  matrix. 
This  sum  is  upper  bounded  by  rrdj,  the  upper  bound  on 
the  propagation  time.  We  require  the  column  sunv  to 


equal  Trdj  since  we  include  the  buffering  primitive, 

in  o>ir  set  of  interconnection  primitives.  Only  one 
interconnection  primitive  for  each  unique  data  link  is 
included  in  the  P  matrix,  i.e.,  even  if  a  data  link  is  bi¬ 
directional,  we  include  only  one  primitive  corresponding 
to  one  of  the  directions.  Consequently,  the  matrix  of 
interconnection  primitives  used  with  RAD  for  the  G/\PP 
architecture  contains  only  the  first  three  columns  of  the 
P  matrix  given  in  example  5.1. 

If  no  solution  exists  to  (5.3),  we  select  another  tt 
from  the  ordered  list  with  minimal  Increase  in  execution 
time.  If  solutions  exists  to  (5.3),  we  order  the 
transformation  matrices  (composed  of  an  S  and  the 
corresponding  n)  according  to  the  AT  (area  x  time) 
criteria.  We  then  choose  the  first  transformation  matrix 
in  the  ordered  list  for  which  a  conflict  does  not  occur.  A 
conflict  occurs  when  two  or  more  computations  are 
mapped  into  the  same  PR  to  be  executed  at  the  same 
time  given  that  only  1  ALU  is  available  in  each  PJS.  In 
other  words  given  two  computations  indexed  by  j'  and 
j",  a  conflict  occurs  when  T  j'  —  T  j"  =■  0  or 

Td'-I'O^O  (5.-I) 

where  j'  —  j"  represents  the  co'iflict  vector.  The  conflict 
vectors  are  generated  using  an  analysis  scheme  similar 


to  the  one  used  with  the  generation  of  bufiering  vectors 
for  input  variables.  If  the  conflict  vector  exists  within 
the  given  iteration  space,  we  disregard  the  corresponding 
T  and  check  the  next  transformation  matrix  in  the 
ordered  list.  We  continue  this  procedure  until  a 
conflict-free  algorithm  transformation  can  be  found  for 
the  partial  mapping  of  the  bit-level  algorithm  into  the 
variable  size  array.  Further  details  about  the  conflict 
vector  can  be  found  in  [Tayl86].  The  conflict-free 
transformation  matrix  for  the  two  convolution  expansion 
are  given  below. 

Example  5.2 

For  the  algorithm  given  in  figure  2a,  the  conflict- 
free  transformation  matrix  for  iteration  space 
defined  as  {l^,<3,  l<j,<3,  1^3<3,  1^<<5}  is 

3  12  1 

T  =  -1-10  0 

-2  0  1  0 

with  execution  time  of  175  and  spatial  requirements 
of  1  GAPP  chip. 

The  conflict-free  transformaPK  •'  matrix  for  the 
algorithm  in  figure  2b  with  the  same  iteration  space 


5  4  2  1 
T  =  10  0  0 

0  0  11 

with  execution  time  of  275  and  spatial  requirements 
of  1  GAPP  chip.  Both  transformations  optimize  the 
measure  A  x  T,  where  A  corresponds  to  the  number 
of  GAPP  chips, 
end  of  example. 

A  full  design  of  an  algorithmically-defined  array  can 
be  specified  by  generating  a  transformation  matrix  using 
the  interconnection  primitives  for  a  planar  array  and 
modifying  the  local  systolic  architecture  parameters,  R 
and  T,  to  mo<|p|  a  general  processing  element.  The 
transformation  matrix  for  a  full  mapping  can  be 
generated  using  the  same  techniques  described  for  a 
partial  mapping,  with  the  exception  of  the  selection  of 
an  S  which  satisfies  the  given  spatial  constraints  of  the 
fixed  size  array.  For  the  case  where  an  S  cannot  be 
found  which  satisfies  these  constraints,  algorithm 
partitioning  is  required.  The  next  section  discusses  the 
last  phase  of  RAB-mapping. 


VI.  MAPPING 

The  mapping  phase  of  RAB  consists  of  the  last  2 
modules  of  the  flow  diagram  shown  in  figure  I, 
microcode  expansion  and  microcode  optimization. 
Microcode  expansion  consists  of  the  replacement  of  the 
given  transformed  computations  with  GAPP  code  (or  the 
code  unique  to  the  architecture  used  for  execution). 
This  code  is  then  optimized  using  a  modified  version  of  a 
technique  developed  by  Ramamoorthy  (|Rama66|, 
(RaGoWl)  known  as  Precedence  Partitioning.  The 
straight-line  microcode  is  parsed  in  a  sequential  manner 
placing  used  and  generated  variables  in  a  symbol  table. 
If  a  used  variable  is  encountered,  the  optimization 
function  checks  the  symbol  table  to  see  if  this  variable 
has  been  generated  in  a  previous  statement  resulting  in 
a  data  dependence.  The  same  applies  for  the  other  two 
algorithm  dependencies.  For  the  statements  which  are 


algorithm  Independent,  we  pairwise  check  the  resources 
required  for  the  parallel  execution  of  two  statements.  If 
the  required  resources  exceed  the  resources  available  in 
R,  then  a  machine  dependence  exists  between  the  two 
statements.  The  algorithm  and  machine  dependencies 
are  represented  in  a  ((v-1)  x  v)  connectivity  matrix, 
where  v  is  the  number  of  statements  in  the  straight-line 
code.  The  element  Cjj  has  value  1  if  statement  j  is 
dependent  on  statement  i  and  it  has  value  0  otherwise. 
The  precedence  partitioning  algorithm  uses  this  matrix 
to  partition  the  set  of  computations  into  independent 
groups  by  locating  columns  containing  zeros  and  deleting 
the  row  corresponding  to  the  partitioned  statement. 
The  partitions  are  executed  serially  but  the  statements 
within  the  partitions  are  executed  in  parallel.  An 
example  of  the  precedence  partition  for  straight  line 
code  is  given  below.  An  example  using  GAPP 
instructions  would  require  detailed  knowledge  about  the 
GAPP  architecture,  which  is  beyond  the  scope  of  this 
paper. 

Example  6.1 

For  the  following  straight  line  code 

(1)  A  =  B  +  C 

(2)  D  =  A  +  E 
(.3)  F  =  D  +  E 
(4)  G  =  H  +  I 

the  connectivity  matrix  is  given  by 
0  10  0 

C  =  0  0  10. 

0  0  0  0 

The  following  partitions  result  from  this  matrix. 

{  1,  4  },  {  2  },  {  3  }. 

end  of  example. 


Vn.  CONCLUDING  REMARKS 

In  this  paper  we  presented  the  overall  organization 
of  RAD  and  discussed  the  concepts  necessary  for  the 
mapping  of  a  class  of  numerical  algorithms  into  bit-level 
systolic  arrays.  We  also  presented  a  method  for 
identifying  and  possibly  eliminating  the  occurrence  of  a 
conflict.  A  conflict  is  more  likely  to  occur  with  bit-level 
algorithms,  since  bit-level  expansions  usually  result  in 
the  addition  of  2  or  3  nestings  of  loops  to  the  original 
algorithm.  Thus  the  iteration  space  of  the  bit-level 
algorithm  with  dimension  greater  than  3  is  mapped  into 
the  space-time  domain  consisting  of  3-dimensional  space. 
This  mapping  of  n-dimensional  space  ^where  n  >  3)  into 
3-dimenaiona!  space  generally  results  in  serializing  some 
of  the  loops  of  the  iteration  space.  With  the  use  of  the 
conflict  analysis,  we  search  for  a  transformation  matrix 
in  which  a  conflict  occurs  outside  of  the  given  iteration 
space  resulting  in  a  bijcctive  mapping. 

RAB  currently  maps  numerical  algorithms  into 
variable  size  arrays  composed  of  GAPP  chips.  However, 
this  tool  can  be  used  to  investigate  the  optimality  (in 
terms  of  spatial  requirements  and  total  execution  time) 
of  different  expansions  of  the  same  task.  The  results  of 
these  investigations  can  be  used  to  efliciently  design 
algorithms  for  parallel  execution.  In  this  paper,  two 
different  expansions  for  convolution  were  transformed 
via  RAB  into  algorithms  for  parallel  execution.  The 
expansion  given  in  figure  2a  resulted  in  an  execution 
time  of  17(5  and  the  second  expansion  given  in  figure  2b 
resulted  in  an  execution  time  of  276.  Thus,  even  though 


both  expansions  performed  the  same  task,  one  expansion 
was  more  suitable  for  parallel  execution  as  evident  by 
the  total  execution  time  (both  expansion  required  only  1 
GAPP  chip). 
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Abstract  —  Improved  multiprocessor  performance  can  be  attained  by  combining 
data  Sow  and  control  Sow  concepts.  This  type  of  combined  architecture  is 
characterised  and  several  examples  of  previously  proposed  machines  are  given. 

A  new  model  is  presented  that  permits  the  analysis  of  such  systems  and  perfor¬ 
mance  measures  are  deSned.  This  model  is  then  used  to  analyze  the  perfor¬ 
mance  of  the  algorithms  under  a  wide  variety  of  combined  systems.  The  results 
of  these  experiments  show  that  partition  size  Is  a  major  factor  in  the  perfor¬ 
mance  of  such  systems  and  an  optimal  size  may  be  found  for  given  system 
parameters. 

1.  Introduction 

This  paper  investigates  the  performance  of  architectures  that  combine  concepts  of  both 
data  Qow  and  control  flow  computers.  In  particular  the  relationships  between  the  granularity  of 
program  partitions  and  the  architecture’s  performance  in  terms  of  execution  space  and  time 
requirements  are  examined.  These  relationships  are  determined  by  studying  the  performance  of 
two  iterative  algorithms.  It  is  shown  that,  for  these  algorithms,  partition  size  has  a  major  efl'ect 
on  the  performance  of  combined  architectures  and  that  an  optimal  partition  size  may  be  found. 

Recently,  there  has  been  considerable  interest  in  combining  some  of  the  concepts  associ¬ 
ated  with  data  flow  computers  with  some  of  those  from  the  realm  of  more  conventional  control 
flow  multiprocessors.  This  interest  seems  well  founded.  While  data  flow  concepts  offer  the 
promise  of  much  increased  execution  speed  by  removing  artificial  sequencing  constraints,  their 
advancement  is  slowed  by  seemingly  insurmountable  problems  |CaP82j.  Meanwhile  a  large  por¬ 
tion  of  the  thirty  years  work  devoted  to  the  study  of  control  flow  methods  does  not  appear 
directly  applicable  to  data  flow  computers. 

This  paper  proceeds  by  quickly  reviewing  some  previous  work  in  combined  systems, 
describing  both  existing  and  proposed  systems.  .Next,  a  model  to  facilitate  the  performance 
evaluation  of  multiprocessor  systems  is  described.  With  this  background  the  paper  describes 
several  experiments  performed  to  illustrate  relationships  between  granularity  and  performance. 
Finally,  several  conclusions  based  on  the  results  of  these  experiments  are  given. 

Thit  work  wu  supported  in  part  by  the  National  Science  Foundation  under  Grant  DCI-841974S  and  in 
part  by  the  Innovative  Science  and  TechnoloKy  USice  of  the  Strategic  Defense  Initiative  OrgaDiiation  and 
was  adminiatered  through  the  Office  of  Naval  Reacarch  under  contract  No.  00014-8&-k-0S88. 


2.  Combined  Architectures 

We  define  data  flow  and  control  flow  as  schemes  to  determine  the  ordering  of  computa¬ 
tional  steps  in  parallel  programs.  A  pure  data  flow  scheme  sequences  operations  based  only  on 
the  availability  of  their  operands  and  adequate  computational  resources.  In  this  sense  pure 
data  flow  Is  a  fully  decentralised  system.  Conversely,  a  pure  control  flow  scheme  is  based  on  a 
schedule  independent  of  the  availability  of  an  operation’s  operands.  In  this  sense  pure  control 
flow  is  a  fully  centralised  system.  Of  course  a  “good”  control  flow  scheme  will  generate  a 
sequence  of  computational  steps  that  guarantees  data  availability.  A  combined  system  is  sim¬ 
ply  a  mix  of  these  two  models  of  computation. 

While  most  parallel  systems  are  not  “pure”  in  their  ordering  scheme,  we  will  be  studying 
only  those  systems  that  combine  wide  variations  in  their  ordering  scheme.  In  such  systems  there 
is,  in  general,  a  division  of  labor  between  various  nearly  pure  ordering  schemes,  with  the  division 
based  on  the  granularity  of  partitioning.  A  graphical  illustration  of  this  combination  is  the  ord¬ 
ering  scheme  graph,  shown  in  Figure  1.  The  ordinate  defines  the  level  of  granularity,  with 
smaller  values  representing  smaller  granules  of  space  and  time.  The  abscissa  represents  the 
degree  to  which  ordering  is  decentralised.  The  range  on  this  axis  is  arbitrarily  set  from  tero  to 
one,  with  lero  representing  a  pure  control  flow  ordering  scheme  and  one  representing  pure  data 
flow.  The  resulting  graph  is  a  set  of  coordinate  points  showing  the  level  of  centralisation  at 
each  level  of  sequencing. 

2.1.  Examples  of  Combined  Systems 

Recent  research  efl'orts  have  produced  numerous  proposals  that  combine  the  concepts  of 
data  flow  and  control  flow.  This  section  describes  several,  showing  an  ordering  scheme  graph  for 
each.  The  purpose  of  this  section  is  to  point  out  the  variety  of  current  proposals,  not  to  discuss 
their  relative  merits. 

The  Piecewise  Data  Flow  Architecture  [ReM83)  uses  a  two  level  approach,  with  distinction 
occurring  at  the  granularity  level  of  the  basic  block.  A  basic  block,  a  term  commonly  used  in 
compiler  theory,  is  a  sequential  program  section  that  has  only  one  entry  point  and  one  or  more 
exit  points.  Internal  to  a  basic  block  a  data  flow  scheme  sequences  operations,  while  the  collec¬ 
tion  of  basic  blocks  that  make  up  a  program  are  executed  sequentially,  with  possible  overlap 
between  two  blocks.  In  addition  to  the  data  flow  scalar  processors,  an  SIMD  processor  is 
included  to  allow  fast  execution  of  vector  operations.  This  makes  thb  machine  an  example  of  a 
truly  combined  architecture,  with  the  combination  being  segregated  into  data  flow  and  control 
flow  sections.  The  goal  of  the  architecture  is  to  allow  sequential  portions  of  scientific  programs 
to  enjoy  the  speedup  that  vector  portions  already  receive  on  systems  like  the  Cray-1. 

The  ordering  scheme  graph  for  the  Piecewise  Data  Flow  architecture  shows  all  low  granu¬ 
larity  operations,  up  to  the  level  of  the  basic  block,  have  a  high  level  of  decentralisation  as  they 
are  sequenced  using  data  flow  concepts.  Operations  with  larger  granularity  would  be  almost 
completely  centralised,  as  they  are  sequenced  serially.  A  possible  ordering  scheme  graph  for  this 
architecture  is  shown  in  Figure  2a.  As  a  basic  block  can  have  a  range  of  granularities,  the 
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transition  between  data  Bow  and  (Control  Sow  ia  shown  as  a  region  instead  of  a  point  and  sym¬ 
bolised  by  the  dotted  area  on  the  graph.  When  considering  this  system,  there  are  several  impor¬ 
tant  features  that  our  study  of  combined  systems  must  consider.  The  first  is  obviously  the  com¬ 
bination  of  data  flow  and  control  flow  at  diSerent  levels  of  granularity.  Another  important 
feature  is  the  level  of  granularity  at  which  the  transition  is  made.  This  will  have  an  important 
effect  on  the  performance  of  such  a  combined  system.  Finally  this  architecture  limits  con¬ 
currency  by  allowing  only  one  block  (perhaps  overlapped  with  the  setup  of  one  other)  to  execute 
at  any  given  time. 

The  Cedar  project  ;GaL84i  proposes  another  split  level  control  scheme,  with  division  at 
the  compound  function  level.  The  granularity  of  compound  functions  IGaKSlj  is  slightly  greater 
than  basic  blocks,  with  operations  like  array  primitives,  linear  recurrences,  FORAXL  loops,  pipe¬ 
line  loops,  block  assignment  statements,  and  compound  conditional  expressions.  The  architec¬ 
ture  consists  of  a  global  control  unit  and  several  processor  clusters.  The  global  control  unit 
sequences  compound  functions  according  to  data  flow  principles.  Processor  clusters  are  assigned 
compound  functions  to  execute  according  to  the  principles  of  control  flow.  Thus,  in  this  sense. 
Cedar  is  the  mirror  image  of  Piecewise,  as  Cedar  uses  data  flow  to  sequence  large  granularity 
items  and  control  flow  to  sequence  low  level  operations,  while  Piecewise  does  the  opposite. 

A  scheme  graph  for  Cedar  shows  all  levels  of  granularity  below  that  of  the  compound 
function  with  a  low  level  of  decentralisation.  Operations  above  this  granularity  would  have 
higher  levels  of  decentralisation.  Figure  2b  shows  a  possible  ordering  scheme  graph  for  Cedar. 
Important  points  about  Cedar  are  similar  to  those  observed  for  Piecewise,  namely  the  change  in 
ordering  scheme  is  directly  related  to  the  granularity  of  operations,  and  the  level  of  granularity 
of  where  this  change  occurs.  Finally,  this  architecture’s  parallelism  between  compound  func¬ 
tions  is  limited  only  by  the  parallelism  available  between  them  and  the  availability  of  processor 
clusters.  Parallelism  within  a  compound  function  may  be  limited  by  the  control  flow  scheme 
used  by  the  processor  cluster,  although  the  compound  functions  have  regularly  structured  paral¬ 
lelism  that  may  be  easily  exploited. 

Remps  [HwX85j  has  the  same  goal  of  both  Cedar  and  Piecewise,  i.e.  scientific  computation. 
The  global  structure  of  the  architecture  is  similar  to  Cedar:  a  collection  of  interconnected  pro¬ 
cessors  and  a  global  controller.  The  key  difference  is  that  Remps  allows  reconfiguration  of  inter¬ 
connection  and  control  to  emulate  a  variety  of  architectures.  On  a  global  level  the  machine  is  a 
data  flow  computer  and  at  the  low  level  each  processor  is  a  reconfigurable  control  flow  com¬ 
puter.  The  level  of  change  in  ordering  scheme  is  the  granularity  of  a  task.  VVhile  the  term  task 
is  nebulous,  it  seems  to  describe  a  level  of  granularity  slightly  larger  than  a  compound  function. 

Drawing  a  scheme  graph  for  Remps  requires  the  understanding  that,  as  in  Cedar,  all  low 
granularity  items  show  low  levels  of  distribution.  Large  granularity  objects  may  be  sequenced 
centrally  (called  macro-pipelining),  or  sequenced  in  a  distributed  fashion  (called  macro-data 
flowing).  A  possible  scheme  graph  is  shown  in  Figure  2c.  The  major  Interesting  feature  of  this 
architecture  is  the  two  schemes  that  exist  simultaneously  for  large  granularity  items,  although 
only  one  is  used  in  the  sequencing  of  a  single  set  of  operations. 
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The  Rediflow  multiprocessor  [KeL84j  contains  a  complex  combination  of  data  flow,  control 
flow  and  reduction  concepts.  Reduction  is  another  decentralised  ordering  scheme  in  which  the 
demand  for  the  result  of  a  computation  causes  its  execution.  Rediflow  consists  of  interconnected 
Xputers  that  combine  processor,  memory,  and  packet  switch  elements.  The  Xputers  function 
under  a  reduction  ordering  scheme.  As  with  the  previously  described  architectures,  Rediflow 
exhibits  a  change  in  sequencing  at  some  level  of  granularity.  Here  the  granularity  is  medium  or 
function-level.  This  level  is  taken  to  be  about  the  same  as  basic  blocks.  Higher  granularity 
items  are  sequenced  by  either  reduction,  data  flow  or  control  flow.  Data  flow  provides  efficient 
pipelining,  while  reduction  may  be  more  adaptable  to  programs  requiring  unpredictable 
buffering.  In  addition,  control  flow  sequencing  is  available  by  so  called  “von  Neumann 
processes’’. 

In  drawing  an  ordering  scheme  graph  for  Rediflow,  reduction  presents  a  new  issue  to  be 
represented.  The  basic  ordering  scheme  graph  is  augmented  with  the  abscissa  extending  from  -1 
to  -1,  with  +1  representing  pure  data  flow  as  before.  The  negative  range  represents  demand 
driven  schemes.  This  extension  shows  the  degree  of  decentralisation  by  the  absolute  value  of 
the  abscissa,  while  the  sign  determines  if  the  operations  are  executed  on  demand  (negative)  or 
availability  (positive).  This  is  a  proper  extension  of  the  ordering  scheme  graph  in  that  a  “pure” 
demand  driven  system  is  also  fully  decentralized  (represented  by  -1)  and  any  given  operation 
will  be  executed  under  either  demand  or  data  flow,  never  both.  Figure  2d  shows  a  possible 
scheme  graph  for  Rediflow.  All  operations  below  a  medium  granularity  are  given  a  high  level  of 
decentralization  in  the  negative  portion  of  the  graph  to  show  demand  driven  computation. 
.Above  this  level,  three  possible  schemes  exist,  resulting  in  the  levels  for  data  flow,  reduction,  and 
von  .Neumann  processes.  When  seen  in  this  light,  the  sequencing  characteristics  of  Rediflow 
appear  somewhat  similar  to  Remps  as  large  granules  may  be  sequenced  in  one  of  several 
methods.  Obviously  low  level  sequencing  is  totally  different.  This  brief  survey  can  be  concluded 
by  reiterating  that  the  ordering  scheme  graphs  show  a  wide  diversity  in  the  approaches  used  in 
combining  data  flow  and  control  flow  concepts.  Many  other  combinations  are  possible,  conceiv¬ 
ably  as  many  as  possible  ordering  scheme  graphs. 

2.2.  A  Variable  Combined  Architecture 

This  study  does  not  investigate  previously  proposed  combined  systems,  but  concentrates  on 
one  extremely  flexible  hypothetical  architecture.  This  hypothetical  system  consists  of  intercon¬ 
nected  processing  elements  each  capable  of  communicating  and  controlling  each  other.  An  equal 
delay  and  infinite  capacity  communication  path  exists  between  each  pair  of  processing  elements. 
The  ordering  scheme  for  this  system  is  a  variable,  two  level  approach.  Larger  granules  are 
sequenced  according  to  either  data  flow  or  control  flow  principles,  while  smaller  granules  are 
sequenced  by  the  opposite  approach.  The  size  at  which  the  switch  occurs,  as  well  as  the  relative 
costs  of  performing  various  operations  are  left  as  variables  in  the  experiments. 

This  approach  has  several  distinct  advantages  over  analyzing  specific  systems,  and  a  few 
shortcomings.  The  greatest  advantage  is  the  availability  of  the  complete  range  of  systems 
between  data  flow  and  control  flow,  approaching  these  by  either  increasing  or  decreasing 


partition  siie.  In  addition,  this  approach  avoids  the  problems  associated  with  comparing  two 
distinct  systems,  concentrating  instead  on  the  underlying  differences  between  ordering  schemes. 
Finally,  this  approach  limits  the  problem  by  ignoring,  at  this  point,  such  issues  as  network 
topology.  Of  course,  this  advantage  can  also  be  a  shortcoming  when  these  particular  issues  play 
a  dominant  role  in  the  system.  This  topic  is  currently  left  for  our  further  research. 

3.  The  COSMIC  Performance  Evaluation  Model 

To  ;.naly*e  the  performance  of  combined  data  flow  and  control  flow  systems  we  have 
developed  COSMIC,  the  Combined  Ordering  Scheme  Model  with  Isolated  Components. 
COSMIC  consists  of  both  formal  parameters  describing  a  multiprocessor  system  and  the  algo¬ 
rithm  it  executes,  and  analysis  techniques  producing  performance  measures.  The  underlying 
principles  of  this  model  are  the  isolation  of  individual  performance  issues  and  the  study  of  sys¬ 
tems  under  conditions  close  to  those  encountered  when  a  system  is  performing  useful  calcula¬ 
tions. 

Previous  work  in  modeling  multiprocessors  has  centered  in  several  distinct  areas.  Program 
behavior  models  endeavor  to  model  the  fundamental  properties  of  a  program  without  regard  for 
hardware  considerations  or  performance  measurement.  They  center  on  the  important  areas  of 
investigating  such  problems  as  the  determinacy,  boundedness,  and  termination  of  programs. 
.Models  fitting  this  category  include  Petri  nets  ;Pet66l  and  Parallel  Program  Schemata  (XaM68l. 
Petri  nets  have  been  augmented  with  the  notion  of  time  in  either  the  deterministic  [RaH80, 
Ram74|  or  stochastic  MolSSi  sense.  The  second  major  category  of  current  models  we  call 
machine  behavior  models  as  they  describe  the  behavior  of  machines  in  their  execution  of  pro¬ 
grams  as  opposed  to  the  behavior  of  programs  themselves.  Examples  of  this  class  include  Turing 
Machines,  Functional  Programming  Systems,  and  the  von  Neumann  Model  Bac78|. 
Classification  models  describe  the  configuration  and  operation  of  multiprocessors,  including 
Flynns  Model  Ely66l,  Handler’s  Classification  System  'Han77j,  and  the  “essential  issues”  of 
Gajski  and  Peir  [GaPSS!. 

COSMIC  builds  on  these  previous  efforts,  but  is  fundamentally  different  from  them  in  that 
it  combines  both  program  and  machine  descriptions,  as  well  as  performance  measures.  The  use¬ 
fulness  of  this  model  is  in  this  combination,  allowing  the  study  of  complete  systems  under  varied 
conditions.  This  section  briefly  describes  COS.MIC  and  its  operation.  A  more  complete  descrip¬ 
tion  is  available  in  CaF87‘. 

3.1.  COSMIC  Parametera 

The  parameters  of  a  system  .i’  include  O.  the  system’s  organisation;  G,,  3  dependence 
graph  describing  a  specific  algorithm;  and  OS.  the  ordering  scheme  used  to  execute  algorithms 
on  the  organisation.  Included  in  a  system’s  organisation  parameter  are  such  features  as  the 
number  and  capacity  of  processing  elements,  e  amount  and  organisation  of  memory,  and  the 
interconnection  amongst  processors  and  memory.  The  dependence  graph  is  simply  an  operation 
level  precedence  graph  for  a  certain  algorithm.  This  graph  includes  only  algorithmic 
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constraints,  not  those  induced  by  operation  sequencing  or  programming  languages.  Finally,  the 
ordering  scheme  describes  how  algorithms  are  executed  on  the  organisation.  The  ordering 
scheme  is  further  segregated  into  descriptions  of  a  system’s  mechanisms  for  partitioning, 
sequencing,  resource  allocation,  and  memory  utilisation. 

3.1.1.  Organisation  (O) 

The  organisation  represents  the  arrangement  of  hardware  elements  in  a  system.  Every 
multiprocessor  has  three  basic  components:  processing  elements  (PE),  memory  locations,  and  the 
interconnections  between  them.  Input  and  output  devices  are  simply  treated  as  specialised  pro¬ 
cessing  or  memory  elements.  Consequently,  our  model  for  organisation  is  represented  by  the  tri¬ 
ple  O  =  (P,  M,  I). 

P  —  A  set  of  processing  elements.  Each  processing  element  has  a  set  of  instructions 
that  it  can  execute  and  a  relative  speed. 

.Vf  —  A  set  of  memory  locations. 

/  —  An  interconnection  function  M  X  P  —*  M  X  P.  This  function  defines  the  possi¬ 
ble  interconnections,  and  with  each  outcome  there  is  a  related  cost  function  that 
describes  the  cost  of  traversing  that  connection.  Local  memory  on  a  PE  can  be 
modeled  by  a  low  cost  (perhaps  zero)  of  traversing  the  connection.  Inaccessible 
memory  (some  other  PE’s  local  memory)  can  be  modeled  by  a  partial  function. 

3.1.2.  Data  Dependency  Graph  (G^) 

The  data  dependence  graph  is  an  arc  and  vertex  weighted  directed  graph  in  which  vertices 
represent  operations  and  arcs  represent  data  dependencies  between  operations.  The  weight  of  a 
vertex  represents  the  relative  time  that  it  will  consume  when  executed.  The  weight  of  am  arc 
represents  the  siie  of  the  data  needing  transfer  to  satisfy  the  dependency.  These  weights  can 
also  be  viewed  as  the  number  of  “atomic  operations”  required  to  complete  the  computational  or 
transfer  operation.  This  graph  is  acyclic,  as  any  loops  in  a  program  are  unfolded  in  creating  the 
dependency  graph.  Currently  data  dependent  behavior  is  not  considered,  but  will  be  a  topiic  for 
future  research. 

3.1.3.  Ordering  Scheme  Function  (OS) 

The  ordering  scheme  for  any  system  is  a  function  mapping  the  dependency  graph  into  an 
ordering  net,  based  on  the  organization  parameters.  An  ordering  net  is  a  timed  Petri  net 
Ram74|  which  depicts  ordering  constraints  placed  on  the  execution  of  operations,  as  well  as  the 
cost  of  each  operation  in  the  modeled  system.  The  ordering  scheme  for  an  organisation,  O  can 
be  defined  as: 

OS{0):  G  —  :V, 

where  G  is  the  set  of  all  possible  dependency  graphs  and  N  is  the  set  of  all  possible  ordering 
nets.  This  function  is  composed  of  several  smaller,  more  easily  defined  functions.  Thus  the 
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ordering  scheme  function, 


05(0)=  7(0)  ;  M  ,  X  0(0)  :  r(0), 

where  the  usual  composition  notation  implies  that  (/  g){x)  is  equivalent  to  /  (g  (i)),  contains 

the  component  functions: 

O  j  — »  ,V  Creates  an  ordering  net  from  , 

-•’(O'l;  .V  — »  .V  Adds  partitioning  constraints, 

.V  — »  .V  Adds  sequencing  constraints, 

.V  — »  .V  Adds  memory  access  constraints  and  time, 

"(O  y  — •  .V  Adds  resource  constraints. 

The  next  sections  briedy  describe  each  function. 

Computation  F  unction  ( r) 

The  computation  function  creates  an  ordering  net  from  a  data  dependency  graph.  Its  sole 
purpose  is  to  change  domains  from  data  dependency  graphs  to  ordering  nets  and  is  scheme 
independent.  Scheme  independence  implies  the  function  itself  never  changes  over  all  possible 
ordering  schemes,  .f^ormally  the  computation  function  is 


where  .V.  is  the  computation  ordering  net  for  a  given  produced  by  r.  The  process  used  is  to 
transform  vertices  in  Gj  to  transitions  in  iV) ,  and  arcs  in  Gj  to  places  in  /V. .  Connections  in 
jV,  are  created  to  preserve  the  structure  of  the  dependency  graph.  Finally,  appropriate  weights 
are  assigned  to  places  and  transitions,  based  on  the  speed  of  processing  elements  as  deSned  by 
the  organisation  parameter. 

Partitioning  Function  (o) 

Partitioning  is  the  process  of  dividing  a  program  into  segments  to  allow  their  execution  on 
possibly  distinct  execution  units.  This  division  requires  the  addition  of  explicit  synchronization 
operations  between  segments  to  preserve  data  dependencies.  The  partitioning  function  creates  a 
new  ordering  net  based  on  these  added  synchronisation  requirements: 

y,.,„  =  c(o,.v.). 

Unlike  the  computation  function,  the  partitioning  function  is  scheme  dependent  but  always  fol¬ 
lows  a  similar  form.  First  the  net  is  divided  into  segments  by  a  scheme  dependent  algorithm. 
Next,  any  transition  connected  via  a  single  place  to  a  transition  in  another  segment  is  synchron¬ 
ised  by  a  place-transition-place  sequence  between  the  two  transitions.  The  transition  models 
the  computation  required  to  complete  the  synchronisation,  while  the  places  model  data  transfer 
required  to  perform  this  function.  Finally,  weights  are  assigned  to  the  newly  created  places  and 
transitions,  commensurate  with  the  cost  of  synchronisation  on  the  system.  This  function  may 


be  applied  recursively  to  segments  to  model  multi-level  ordering  schemes. 

Sequencing  Function  (X) 

The  sequencing  function  is  responsible  for  adding  constraints  to  the  model  induced  by  the 
sequencing  of  segments  and  operations  within  those  segments.  This  function  causes  the  interpre¬ 
tation  of  either  control  Bow,  data  flow,  or  combined  schemes.  Formally,  the  sequencing  function 
produces  a  new  ordering  net  from  its  input: 

=  \(iV„,]. 

Again,  this  function  is  scheme  dependent,  and  must  be  specifically  defined  for  each  ordering 
scheme.  Several  levels  of  sequencing  strategy  may  be  modeled,  based  on  the  recursions  in  the 
partitioning  function.  At  the  lowest  level,  place-transition-place  sequences  are  inserted  between 
transitions  within  a  single  segment  to  model  their  sequencing.  In  a  data  flow  scheme  these  will 
be  in  parallel  with  each  place,  as  sequencing  occurs  on  each  data  transfer.  In  a  control  flow 
scheme,  they  are  placed  between  transitions  along  some  execution  trace. 

.At  higher  levels,  segments  are  sequenced  by  creating  place-transition-place  sequences 
between  segments.  The  details  of  this  placement  are  dependent  on  the  scheduling  strategy  being 
modeled.  Again,  appropriate  weights  are  assigned  to  all  places  and  transitions  added  by  thb 
function. 

Memory  Access  Function  (p) 

Recall  that  in  a  data  dependence  graph  an  arc  was  weighed  in  accordance  with  the 
amount  of  information  transfer  required  over  that  arc.  These  weights  were  transferred  to  places 
in  the  ordering  net.  The  memory  access  function  produces  a  new  ordering  net  reflecting  the 
added  costs  of  memory  access  and  interconnection  network  traversal: 

)• 

This  function  is  scheme  independent  and  simply  replaces  each  non-iero  weighted  place  with  a 
piace-transition-place  sequence.  The  new  transition  is  given  a  weight  equal  to  the  weight  of  the 
place  it  replaces. 

Resource  Allocation  Function  ('7) 

Resource  allocation  is  the  process  of  assigning  sets  of  transitions  to  sets  of  resources  (pro¬ 
cessing  elements  and  memory  locations).  This  function  produces  a  new  ordering  net  limiting 
concurrency  within  these  sets: 

N,.  =  7(o,N,.| 

The  resource  allocation  function  is  organisation  dependent  and  must  be  defined  for  each  system 
modeled.  In  general,  a  resource  allocation  function  will  assign  groups  of  transitions  to  resource 
pools  modeled  by  the  addition  of  places  to  Che  ordering  net.  This  allows  concurrency  between  a 


set  of  transitions  to  be  limited  by  the  availability  of  a  limited  resource.  It  should  be  noted  that 
this  also  allows  the  modeling  of  resource  contention  for  memory  devices. 


3.2.  Analysis  and  Meaaures 

.\fter  a  system  has  been  described  by  the  parameters  of  COSMIC,  it  is  analyzed  to  deter¬ 
mine  several  performance  measures.  This  analysis  involves  the  determination  of  the  time 
between  the  firing  of  the  initial  and  final  transitions  in  an  ordering  net.  Computerized  analysis 
tools  aid  in  this  determination.  The  analysis  begins  by  creating  ordering  nets  using  a  high-level 
description  language  that  enables  the  specification  of  parameterized  nets.  Generally,  these 
parameters  include  the  problem  size  and  relative  costs  for  computation,  sequencing,  and  syn¬ 
chronization.  .\  compiler  then  fixes  values  for  these  parameters  and  produces  a  set  of  intercon¬ 
nected  places  and  transitions.  Next,  a  net  analyzer  determines  the  various  measures  by  firing 
the  net  following  the  rules  of  timed  Petri  nets  Ram74i.  Finally  the  results  of  many  analyses  are 
gathered  into  a  database  for  further  off-line  studies.  The  entire  system  is  capable  of  analyzing 
nets  up  to  about  50.000  places  and  transitions  while  consuming  reasonable  amounts  of  computa¬ 
tional  resources.  This  enables  the  analysis  of  moderately  large  problems. 

Three  values  are  associated  with  each  performance  measure;  the  serial  lime,  the  critical 
path  time,  and  the  number  of  resources  required  to  achieve  the  critical  path  time.  These  values 
describe  both  the  time  and  space  requirements  of  the  modeled  system  for  a  given  configuration. 
Two  classes  of  measures  are  used:  primary  measures  represent  consumption  of  resources  directly 
related  to  the  algori  hmic  requirements  of  the  system,  and  overhead  measures  show  the  con¬ 
sumption  of  resources  unrelated  to  any  algorithmic  requirements.  The  analysis  consists  of  the 
application  of  two  analysis  functions.  The  serial  analysis  function  is: 

where  R  represents  the  set  of  real  numbers  and  N  the  set  of  ordering  nets.  It  computes  the 
time  required  to  fire  all  nodes  in  an  ordering  net,  with  the  added  constraint  that  no  two  transi¬ 
tions  may  fire  simultaneously.  The  critical  path  analysis  function  is: 

N  —  R  n, 

where  R  represents  the  set  of  real  numbers,  /V  the  set  of  non-negative  integers,  .V  the  set  of 
ordering  nets,  and  X  the  cross  product.  It  computes  the  time  required  to  fire  all  nodes  in  an 
ordering  net,  with  only  the  constraints  presented  by  the  net,  as  well  as  the  number  of  resources 
required  to  achieve  that  level  of  performance.  Finally  the  general  analysis  function, 

.4jV  :  ,V  —  fi  .  f?  .  /V, 

simply  combines  of  the  two  previous  functions,  the  result  of  which  is  a  triple  of  values:  (Serial 
Time,  Critical  Path  Time,  Critical  Path  Space). 

If  we  let  M  be  such  a  triple,  the  total  execution  measure  for  a  model  with  organization  O, 
ordering  scheme  OS,  and  data  dependency  graph  G4  is; 
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The  execution  measure  is  also,  by  definition,  the  sum  of  the  five  previously  defined  measures: 

where  M,  Is  the  computation  measure,  partitioning  costs  measure,  the  sequencing 

costs  measure,  is  the  memory  access  measure,  and  Mr,,  is  the  resource  allocation  coats 

measure.  All  the  submeasures  are  also  triples,  the  addition  of  which  is  defined  in  the  usual 
manner  by  adding  corresponding  entries.  These  measures  represent  the  analysis  of  an  ordering 
net  resulting  from  the  application  of  a  subset  of  the  ordering  scheme  function.  M,  is  the  pri¬ 
mary  measure,  while  the  others  are  overhead  measures.  Overhead  measures  may  contain  nega¬ 
tive  entries  for  Critical  Path  Space:  when  the  critical  path  time  grows,  the  space  required  to 
achieve  that  performance  may  decrease. 


4.  Experiments  and  Results 


This  section  presents  results  gauging  the  effect  of  partition  size  on  combined  system  perfor¬ 
mance.  The  performance  of  two  simple  algorithms  is  studied  in  the  environment  of  a  hypotheti¬ 
cal  architecture  capable  of  executing  instructions  under  the  control  of  a  variety  of  ordering 
schemes.  The  organisation,  data  dependency  graphs,  and  the  various  functions  of  the  ordering 
scheme  that  manipulate  them  are  described.  Numerical  results  from  experiments  are  presented 
graphically  and  as  polynomial  equations.  As  a  compromise  between  the  infinite  variability  of 
our  hypothetical  architecture  and  the  availability  of  computational  resources  to  analyse  our 
systems,  several  restrictions  are  imposed  on  the  experiments.  Specifically,  the  scope  of  analysu 
is  limited  by  assuming  that  resource  allocation  and  memory  access  constraints  are  ignored.  This 
will  lead  to  resource  allocation  and  memory  access  functions  being  set  equal  to  the  identity 
function.  The  resource  allocation  limitations  can  be  justified  by  assuming  equally  fair  and 
efficient  implementations  on  all  systems.  The  memory  access  limitation  can  be  justified  simi¬ 
larly,  although  different  numbers  of  memory  accesses  may  be  required  by  the  various  ordering 
schemes.  However,  these  factors  may  effect  system  performance  and  ongoing  research  is  aimed 
at  eliminating  these  restrictions. 
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4.1.  The  Organisation 

First,  the  organisation  parameters  for  our  hypothetical  architecture  are  described.  As 
memory  access  or  resource  allocation  are  not  considered,  the  organizational  parameters  of 
consequence  are  the  number  and  speed  of  the  processing  elements.  Both  are  treated  as  variables 
in  these  experiments.  It  is  also  assumed  that  all  processing  elements  are  interconnected,  with  a 
communication  cost  of  zero  from  any  source  to  any  destination.  Future  research  is  planned  to 
investigate  the  effects  of  interconnection  topology  and  cost  on  combined  system  performance. 


4.2.  The  Dependency  Graphs  : 

The  first  algorithm  studied  is  for  matrix-vector  multiplication  using  the  algorithm  shown  r 

in  Figure  3a,  in  which  the  matrix  has  sise  {SIZExSIZE),  In  forming  the  data  dependency  graph 

for  this  algorithm,  note  the  central  operations  in  the  algorithm  are  the  multiplication  of  two  • 
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numbers  and  the  addition  of  the  result  to  a  running  sum.  This  central  operation  will  occur 
SlZEfi  times  in  the  dependencjr  graph.  Therefore  a  base  structure  consisting  of  two  vertices  con¬ 
nected  by  a  directed  arc  Is  created.  The  vertex  at  the  tail  of  the  arc  represents  the  multiplica¬ 
tion  operation,  while  the  vertex  at  the  head  represents  the  addition.  Two  arcs  enter  the  multi¬ 
plication  vertex,  representing  the  matrix/vector  input  values,  and  one  additional  arc  enters  the 
addition  vertex  to  represent  the  previous  value  of  the  running  sum.  The  addition  vertex  has  a 
single  output  arc.  Therefore,  creating  a  dependency  graph  for  the  algorithm  involves  replicating 
this  structure  SIZE!''  times  and  interconnecting  appropriately.  Added  to  this  graph  are  SIZE 
vertices  representing  the  input  vector  and  SIZE'  vertices  representing  the  input  array.  Figure 
3b  shows  such  a  graph  for  the  case  when  SIZE  -  3.  In  this  figure  the  computational  vertices 
are  represented  by  circles  and  the  input  matrix/vector  vertices  by  squares.  Note  that  the  input 
vertices  are  connected  to  the  multiplication  operation  and  the  addition  operations  are  chained 
to  form  the  complete  dot  product  operation. 

The  second  algorithm  studied  computes  a  4-point  iterative  relaxation  function,  using  the 
algorithm  shown  in  Figure  4a,  in  which  the  matrix  has  siie  (SIZExSIZE)  and  ITER  iterations 
are  computed.  When  ail  loops  are  unfolded  into  their  basic  components,  a  central  computa¬ 
tional  block  repeats  many  times  throughout  the  algorithm.  Here,  the  computational  block  con¬ 
sists  of  three  additions  and  a  division,  therefore  resulting  in  a  4  vertex  graph  with  4  inputs  and 
one  output.  This  basic  graph  is  repeated  SIZE'  X  ITER  times  and  appropriate  interconnections 
are  made.  .As  the  dependency  graph  for  the  complete  algorithm  is  complex,  Figure  4b  shows 
only  the  central  computational  block.  In  this  algorithm  out-of-range  indices  in  array  subscripts 
“wrap-around”  using  the  modulus  function,  and  for  simplicity  initial  input  arcs  are  ignored. 

4.3.  The  Ordering  Schemes 

Our  experiments  investigate  two  classes  of  ordering  schemes.  Both  two  level  combined 
approaches  require  an  ordering  net  partitioned  into  segments,  with  the  number  of  segments 
being  a  variable  for  experimentation.  The  first  ordering  scheme,  denoted  Cpart,  sequences  par¬ 
titions  using  a  control  flow  ordering  scheme,  while  individual  operations  within  a  partition  are 
sequenced  using  data  flow  concepts.  The  other  ordering  scheme,  denoted  Dpart,  sequences  parti¬ 
tions  using  a  data  flow  ordering  scheme,  while  individual  operations  within  a  partition  are 
sequenced  using  control  flow  concepts.  This  section  discusses  the  specifics  of  the  partitioning 
and  sequencing  functions  for  each  case.  The  computation  function,  assigns  firing  times  to  the 
transitions  it  creates  based  on  a  parameter  of  the  experiments  called  the  computation  time 
.Again  note  that  i  and  /i,  the  resource  allocation  and  memory  access  functions,  are  the  identity 
function. 

The  same  partitioning  function,  0,  is  used  for  both  the  Cpart  and  Dpart  ordering  schemes. 
As  both  algorithms  have  a  grid  structure,  the  ordering  net  is  partitioned  first  by  columns  in  that 
grid  of  operations,  and  then  if  required  by  rows.  For  example,  if  3  partitions  were  to  be  created 
from  a  matrix  example  with  SIZE  =  3,  each  column  in  the  grid  of  operations  (see  Figure  3a) 
would  be  placed  in  its  own  partition.  If  six  partitions  were  required,  then  each  of  the  original 
partitions  would  be  divided  in  two.  This  strategy  keeps  operations  that  communicate  most 


often  in  the  same  partition  whenever  possible.  Synchronisation  operations  are  then  placed 
between  each  pair  of  connected  computational  vertices  that  reside  in  different  partitions.  The 
firing  time  of  the  additional  transitions  is  the  variable  called  synchronization  time. 


The  sequencing  function  for  the  Cpart  ordering  scheme,  places  a  data  flow  sequencing 
operation  in  parallel  with  each  unsynchronised  place.  This  enforces  a  low  level  data  flow 
scheme.  When  more  partitions  are  created  than  columns  in  the  grid  structure  of  operations, 
groups  of  “number  of  columns’’  partitions  must  be  sequenced  by  “plies”.  To  this  end  also 
forces  each  ply  of  partitions  to  complete  execution  before  the  next  is  started.  This  is  enforced 
by  adding  a  single  transition  between  the  plies.  In  effect  this  function  implements  the  control 
flow  synchronisation  strategy  called  “Barrier  Synchronisation”.  The  firing  times  of  the  addi¬ 
tional  transitions  are  set  to  the  sequencing  time  variable. 

The  sequencing  function  for  the  Opart  ordering  scheme,  X^,  places  a  control  flow  sequenc¬ 
ing  operation  between  operations  within  partitions  to  assure  that  no  concurrency  will  take  place 
within  a  partition  (i.e.  a  single  trace  of  operations  is  executed  serially).  The  previously  added 
synchronisation  operations  already  ensure  data  flow  sequencing  among  partitions.  The  firing 
time  of  these  additional  transitions  is  again  called  the  sequencing  time. 

4.4.  The  Experiments 

Experiments  were  conducted  to  determine  the  system’s  sensitivity  to  changes  in  problem 
sise  and  the  relative  time  required  to  execute  computational,  synchronisation,  and  sequencing 
operations.  In  each  experiment,  the  measures  that  comprise  the  triple  were  deter¬ 

mined.  Each  experiment  consisted  of  setting  the  problem  sise  and  various  time  requirements 
constant  and  varying  the  number  of  partitions  over  the  range  of  “uniform”  sises  (i.e.  those  in 
which  each  partition  had  an  equal  number  of  operations  to  perform).  The  results  of  numerous 
experiments  of  varying  time  requirements  and  problem  sise  were  combined  to  understand  the 
interdependence  of  all  these  factors  on  the  performance  of  the  systems. 

After  numerical  results  from  the  experiments  were  obtained,  those  related  to  the  critical 
path  execution  time  were  fit  to  polynomial  curves  based  on  the  number  of  partitions.  Except  for 
a  few  “off  by  one”  errors  at  extreme  partition  sises,  ail  cases  exhibit  a  piecewise  linear  relation¬ 
ship  between  the  number  of  partitions  and  the  critical  path  performance  of  the  algorithm. 
Next,  several  equations  from  experiments  corresponding  to  variations  of  the  cost  variables  were 
combined  to  obtain  polynomial  equations  for  each  measure  based  on  both  the  number  of  parti¬ 
tions  and  the  cost  variables  (e.g.  sequencing  time).  Again  ail  equations  could  be  combined  in  a 
piecewise  linear  fashion.  At  this  point  in  the  analysis  several  equations  represent  each  measure, 
one  in  terms  of  each  cost  variable.  These  equations  were  then  unified  to  a  single  equation  for 
each  measure  in  terms  of  all  the  cost  variables  and  the  number  of  partitions.  These  equations 
can  be  verified  by  substituting  appropriate  constant  for  the  cost  variables  to  obtain  the  com¬ 
ponent  equations.  Finally,  the  results  of  experiments  on  different  problems  sises  were  combined 
to  obtain  the  final  critical  path  equations  for  each  measure. 
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The  critical  path  measurement  equations  are  shown  in  Tables  I  and  2  for  the  matrix  mul¬ 
tiplication  and  iterative  relaxation  algorithms  respectively.  In  these  tables  (and  the  remainder 
of  this  paper),  N  represents  the  number  of  partitions;  SIZE  the  problem  site;  the  computa¬ 
tion  time;  the  synchronisation  time;  and  the  sequencing  time.  Also,  note  the  ceiling 

function  |  z  represents  the  smallest  integer  >  z  and  9  represents  the  unit  step  function: 

f 0  if  i<0 
"  |l  if  z>0. 

Figures  5  through  8  are  graphical  representations  of  the  equations  for  the  execution  time  meas¬ 
ures  (obtained  by  summing  appropriate  submeasures).  The  figures  contain  three  graphs,  each 
varying  one  of  the  cost  variables.  Figures  5  and  8  show  matrix  multiplications  results  while  Fig¬ 
ures  7  and  8  are  from  the  iterative  relaxation  experiments.  Figures  5  and  7  are  for  the  Cpart 
ordering  scheme  while  Figures  6  and  8  represent  the  Dpart  ordering  scheme.  In  all  cases  the 
problem  size  and  default  values  of  the  time  parameters  are  taken  to  be  8.  Circles  on  the  graphs 
indicate  function  values  at  uniform  partition  sites. 

Examination  of  the  measure  equations  yields  a  good  understanding  of  the  performance  of 
these  two  algorithms.  While  space  limitations  prevent  complete  analysis  of  these  functions, 
available  in  [CaF87],  this  paper  endeavors  to  provide  both  the  flavor  and  some  interesting 
results  from  the  analysis.  The  matrix  multiplication  algorithm’s  computation  measure  is 
\^SIZE  -i-  l|  Tf,  which  is  easily  explained  by  examining  Figure  3b.  The  length  of  a  critical  path 
is  one  greater  than  the  site  of  the  problem,  and  each  computation  requires  to  complete. 

This  algorithm’s  partitioning  measure  contains  three  components.  The  first  two  indicate 
that  two  synchronisation  operations  will  enter  the  critical  path  when  N  <  SIZE.  This  number 
increases  with  N  after  N  exceeds  the  problem  site.  Two  initial  operations  result  from  the  syn¬ 
chronisations  required  to  start  and  end  each  segment.  The  increasing  factor  that  exists  when 
there  are  more  partitions  than  columns  of  computations  (SIZE)  results  from  added  synchronisa¬ 
tions  needed  between  serial  partitions.  .Note  also  the  special  case  of  one  segment  requiring  only 
one  synchronisation  operation.  This  increase  produces  the  staircase  nature  of  Figure  7  and 
results  when  a  single  partition  is  added  to  a  “uniform”  number  causing  critical  path  length  to 
increase.  The  final  factor  results  from  a  synchronisation  operation  in  parallel  with  a  computa¬ 
tion  operation  becoming  more  dominant,  within  a  range  of  partition  sites. 

The  Cpart  ordering  scheme’s  sequencing  measure,  obviously  the  most  complex,  consists  of 
four  parts.  The  first  indicates  a  sequencing  operation  and  an  additional  computational  vertex 
per  “layer”  will  enter  the  critical  path,  while  a  synchronization  operation  leaves.  The  next 
component  indicates  SIZE  —  1  additional  sequencing  operations  are  in  the  critical  path,  one 
between  each  stage  of  the  computation.  The  final  factor  is  similar  to  the  final  factor  of  the  par¬ 
titioning  measure,  adjusting  which  operation  are  in  the  critical  path  as  their  costs  vary. 

The  Dpart  ordering  scheme's  sequencing  measure  also  consists  of  four  parts.  The  first  fac¬ 
tor  indicates  that  indicates  that  as  the  number  of  partitions  increases,  added  computational 
operations  will  drop  out  of  the  critical  path,  to  the  point  where  no  additional  operations  are 


pUced  in  the  path  by  the  sequencing  function  when  .V  =  SIZE'.  The  second  factor  shows  the 
same  trend  and  that  there  are  two  sequencing  operations  associated  with  each  computation, 
with  sequencing  operations  dropping  out  of  the  critical  path  as  the  number  of  partitions 
increases.  The  final  factor  are  similar  to  that  in  the  Cpart  ordering  scheme. 

.Now  consider  the  iterative  relaxation  experiments.  In  these  experiments,  three  iterations 
of  the  algorithm  were  run  (i.e.  ITER  =  3)  which  indicates  that  the  critical  path  (using  a  wave- 
front  strategy)  will  be  four  times  the  sise  of  the  problem,  minus  1.  Since  the  critical  path 
through  a  single  operation  in  3  operations  long,  the  'esultant  computation  measure  is  just  that 
given.  The  partitioning  measure  indicates  that  when  N  <  SIZE  three  synchronisations  are 
required  for  each  partition:  between  each  stage  of  the  wavefront.  Again  the  "ply  "  oriented  syn¬ 
chronisations  exist  above  this  level.  Again  as  in  the  matrix  multiply  algorithm,  there  is  a  spe¬ 
cial  case  when  iV  =  1  with  2  fewer  synchronisation  operations  required. 

The  Cpart  sequencing  measure  consists  of  three  components,  one  for  each  cost  variable. 
The  first  component  indicates  that  as  sequencing  constraints  are  added  to  the  model  more  of  the 
computational  operations  fall  along  the  critical  path.  When  there  are  fewer  partitions  than  the 
problem  sise  this  is  a  constant  factor,  and  above  this  number  a  linear  increase  is  seen.  The 
second  component  shows  the  sequencing  operations  that  fall  along  the  critical  path,  which  has 
similar  form  to  the  added  computational  operations.  Finally  we  again  see  that  several  syn¬ 
chronisation  operations  are  removed  from  the  critical  path. 

The  Dpart  sequencing  measure  is  similar  in  form  to  the  Cpart  measure,  except  that  the 
weight  of  the  computational  and  sequencing  terms  decreases  linearly  above  SIZE  partitions 
instead  of  increasing.  These  factors  are  also  responsible  for  the  discontinuities  that  exist  at 
SIZE  partitions.  The  final  two  terms  of  this  expression  indicate  the  removal  of  synchronisation 
operations  b  limited,  as  in  the  matrix  multiplication  sequencing  measures. 

The  following  observations  result  from  the  outcomes  of  ouv  experiments,  as  depicted  in 
Tables  1  and  2  and  Figures  5-8. 

•  The  relationship  between  ^granularity  and  execution  time. 

Figures  5  through  8  show  that  granularity  has  a  noticeable  effect  on  the  execution 
time  performance  of  these  algorithms  in  the  combined  environment.  In  Figure  5  we 
see  that,  as  iW  increases,  the  execution  time  increases.  Thb  b  a  logical  outcome  for 
the  Cpart  scheme,  as  parallelism  is  restricted  when  the  partition  sise  drops  below  the 
sise  containing  a  complete  column  of  the  calculation.  Figure  8,  however,  shows 
decreasing  execution  time  with  increasing  N.  .\gain,  thb  "is  logical  as  the  Dpart 
scheme  restricts  parallelism  when  there  are  many  calculations  in  a  single  partition. 
Interestingly,  we  see  that  analogous  general  trends  hold  in  the  relaxation  algorithm, 
as  illustrated  by  Figures  7  and  8.  Tables  1  and  2  confirm  these  results. 

•  The  effects  of  changing  the  relative  costs  of  computation,  synchronisation,  and  sequencing. 

Tables  1  and  2  show  the  relationships  between  execution  time  and  T^,  T„,,  and 
linear  for  a  given  problem  sise  and  number  of  partitions. 


•  The  dominant  costs  in  the  performance  of  these  algorithms. 

Figures  5  through  8  show  that  computation  and  sequencing  time  are  the  dominant 
factors  in  the  performance  of  these  algorithms.  The  effect  of  increasing  or  decreasing 
their  cost  by  a  constant  term  increases  or  decreases  the  execution  time  by  a  factor 
at  least  three  times  the  effect  of  changing  the  partition  sise  by  the  same  amount. 
Tables  1  and  2  con&rm  these  results  as  we  see  larger  factors  associated  with  the  T. 
and  T terms  than  the  terms. 

•  The  optimal  number  of  partitions. 

[n  each  experiment,  the  optimal  number  of  partitions  varies  and  is  dependent  on  the 
relative  costs  of  computation,  synchronisation,  and  sequencing  operations. 

Matrix  Multiplication,  Cpart  Ordering  Scheme  --  Figure  5  shows  the  optimal  number 
in  all  cases  is  a  single  partition. 

.Matrix  Multiplication,  Opart  Ordering  Scheme  --  Figure  8b  shows  that  as  synchroni¬ 
zation  time  increases  the  optimal  number  of  partitions  changes  from  64  [SIZE-)  to 
one. 

Iterative  Relaxation,  Cpart  Ordering  Scheme  —  Figure  7c  shows  that  as  sequencing 
time  becomes  dominant,  the  optimal  number  of  partitions  is  8  [SIZE),  while  Figure 
7b  shows  that  when  the  synchronisation  time  becomes  dominant  the  optimal  number 
is  one. 

Iterative  Relaxation,  Opart  Ordering  Scheme  --  Figure  8b  illustrates  that  as  syn¬ 
chronisation  time  becomes  dominant,  the  optimal  partition  sise  moves  from  64 
[SIZE-)  to  1. 

•  The  effect  of  changing  problem  sise. 

Examining  Tables  1  and  2  we  see  that  problem  sise  plays  two  roles  in  the  perfor¬ 
mance  of  these  algorithms.  The  hrst  is  the  linearly  increasing  critical  path  execution 
time  with  increasing  problems  sise,  which  is  the  critical  path  performance  of  these 
algorithms.  The  second  role  is  the  determination  of  the  '‘uniform”  number  of  parti- 

N 

tions  as  evidenced  by  the  — —  terms  throughout  these  tables. 

SIZE 

6.  Conclusions  and  Further  Work 

COSMIC  has  been  used  to  study  combined  systems,  and  was  illustrated  by  studying  the 
impact  of  partition  sise  on  a  system’s  performance.  This  allowed  the  identification  the  optimal 
partition  sise  in  relation  to  given  system  parameters.  While  these  results  apply  directly  only  to 
two  iterative  algorithms  (differing  mainly  in  their  interconnectivity),  they  provided  hints  to 
what  factors  effect  the  performance  of  combined  systems.  Future  work  will  focus  on  efforts  to 
generalise  these  results  to  other  algorithms  and  include  the  effect  of  memory  accessing  and 
resource  allocation. 
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Table  2 

Iterative  Relaxation  Critical  Path  Measures 
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Systolic  arrays  have 
regular  and  modular 
structures  that  match 
the  computational 
requirements  of  many 
algorithms.  Their 
implementation 
requires  that  a  wealth 
of  subsumed  concepts 
and  engineering 
solutions  be  mastered 
and  understood. 
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ystolic  arrays  are  the  result  of 
advances  in  semiconductor  tech¬ 
nology  and  of  applications  that 
require  extensive  throughput.  Their  reali¬ 
zation  requires  human  ingenuity  combined 
with  techniques  and  tools  for  algorithm 
development,  architecture  design,  and 
hardware  implementation. 

Invariably,  the  first  reaction  of  people 
who  are  exposed  to  the  sy;tolic-array  con¬ 
cept  is  one  of  admiration  for  the  concept’s 
elegance  and  for  its  potential  for  high  per¬ 
formance.  However,  those  who  next 
attempt  to  implement  a  systolic  array  for 
a  specific  application  soon  realize  that  a 
wealth  of  subsumed  concepts  and  engi¬ 
neering  solutions  must  be  mastered  and 
understood.  This  special  issue  attempts  to 
provide  insights  into  the  implementation 
process  and  to  illustrate  the  different  tech¬ 
niques  and  theories  that  contribute  to  the 
design  of  systolic  arrays. 


Characteristics  of 
systolic  arrays 


Since  1978,  when  H.T.  Kung  and  C.E. 
Leiserson‘  introduced  the  term  “systolic 


array’’  and  the  concept  behind  the  term, 
much  research  has  been  done  and  much 
has  been  written  about  the  design  of 
algorithms  and  architectures  suitable  for 
such  structures.  Today,  the  idea  of  a  sys¬ 
tolic  array  is  as  familiar  to  many  computer 
scientists  and  engineers  as  that  of  a  com¬ 
piler  or  a  microprocessor. 

The  term  “array”  originates  in  the  sys¬ 
tolic  array’s  resemblance  to  a  grid  in  which 
each  point  corresponds  to  a  processor  and 
a  line  corresponds  to  a  link  between 
processors.  As  regards  this  structure,  sys¬ 
tolic  arrays  are  descendants  of  array-like 
architectures  such  as  iterative  arrays,'  cel¬ 
lular  automata,'  and  processor  arrays  * 
These  architectures  capitalize  on  regular 
and  modular  structures  that  match  the 
computational  requirements  of  many 
algorithms.  Table  1  is  a  list  of  applications 
for  which  systolic  designs  are  available. 
Systolic  arrays  belong  to  the  generation  ot 
VLSl/WSl  ('Very  Targe  Scale  Integra¬ 
tion/Wafer  Scale  Integration)  architec¬ 
tures  for  which  regularity  and  modularity 
are  important  to  area-efficient  layouts. 

.Although  the  array  structure  character¬ 
izes  the  interconnections  in  systolic  arrays. 
It  is  the  term  “systolic"  that  captures  the 
innovative  and  distinctive  behavior  of 


COMPUTER 


these  systems.  “Systolic"  in  this  contex, 
means  that  pipelined  computations  take 
place  along  ail  dimensions  of  the  array  and 
result  in  very  high  computational  through¬ 
put.  In  other  words,  systolic  algorithms 
schedule  computations  in  such  a  way  that 
a  data  item  is  not  only  used  when  it  is  input 
but  also  is  reused  as  it  moves  through  the 
pipelines  in  the  array.  This  results  in 
balancing  the  processing  and  input/output 
bandwidths.  especially  in  compute-bound 
problems  that  have  more  computations  to 
be  performed  than  they  have  inputs  and 
outputs.  Conventional  processor  designs 
are  often  limited  by  the  mismatch  of  input 
bandwidth  and  output  bandwidth,  which 
occurs  because  data  items  are  read/writ¬ 
ten  every  time  they  are  referenced. 

One  reason  for  choosing  “systolic"  as 
part  of  the  term  “systolic  array”  was  to 
draw  an  analogy  with  the  human  circula¬ 
tory  system,  in  which  the  heart  sends  and 
receives  a  large  amount  of  blood  as  a  result 
of  the  frequent  and  rhythmic  pumping  of 
small  amounts  of  that  fluid  through  the 
arteries  and  veins.  In  this  analogy,  the 
hean  corresponds  to  a  source  and  destina¬ 
tion  of  data,  such  as  a  global  memory,  and 
the  network  of  veins  is  equivalent  to  the 
array  of  processors  and  links.  Another 
explanation  of  the  term  is  that  in  many  of 
the  Tirst  proposed  systolic  architectures, 
processing  elements  alternated  between 
cycles  of  “admission”  and  “expulsion”  of 
data— much  in  the  same  way  that  the  heart 
behaves  with  respea  to  the  pumping  of 
blood. 

In  the  article  “Why  Systolic  Architec¬ 
tures?”’  H.T.  Kung  presents  an  excellent 
introduction  to  the  basic  ideas,  the  advan¬ 
tages,  and  the  open  problems  of  systolic 
arrays.  Today,  this  article  is  still  essential 
reading  for  those  interested  in  learning  the 
fundamentals  of  systolic  arrays.  Our  intro¬ 
duction  endeavors  neither  to  replace  nor 
to  repeat  the  contents  of  that  pioneering 
article.  However,  it  is  appropriate  to 
elaborate  briefly  on  the  three  factors  that 
characterize  systolic  arrays  as  they  were 
originally  proposed,  namely  technology, 
parallel/ pipelined  processing,  and  appli¬ 
cations.  These  factors  also  identify  the  rea¬ 
sons  for  the  success  of  the  concept,  namely 
cost-effectiveness,  high  performance,  and 
the  abundance  of  applications  for  which 
systolic  arrays  can  be  used. 

Technology  and  cosl-cffectlvencss. 

Nowadays,  mature  VLSl/WSI  technology 
permits  the  manufacture  of  circuits  whose 
layouts  have  mimmum  feature  sizes  of  1  to 
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?  micron.  The  effective  yields  of 
VLSI/WSi  fabrication  processes  make 
oossible  the  implementation  of  circuits 
with  up  to  half  a  million  transistors  at 
reasonable  cost— even  for  relatively  small 
production  quantities.  However,  the 
advantages  of  this  technology  are  not  fuliv 
realized  unless  simple,  regular,  and  modu¬ 
lar  layouts  are  used.  Systolic  arrays 
attempt  to  meet  these  topoloincal  con¬ 
straints  by  using  simple  processing  ele¬ 
ments  that,  together  with  a  simple 
interconnection  pattern,  are  replicated 
along  one  or  more  dimensions.  Cost, 
regularity,  and  modularity  are  factors 
leading  to  the  design  and  optimization  of 
individual  processing  elements  and  their 
respective  interconnections.  Considera¬ 
tion  of  these  three  factors  indicates  that 
processor  arrays  are  cost-effective  engi¬ 
neering  solutions  to  the  problem  of  build¬ 
ing  systems  with  many  processing 
elements. 

The  main  difference  between  the  design 
of  systolic  arrays  and  that  of  other  inte¬ 
grated  systems  of  comparable  complexity 
IS  illustrated  in  a  general  way  in  Figure  1 . 
The  y -chart  shown  in  the  figure  is  a  con¬ 
venient  and  succinct  description  ol  the 
different  phases  of  the  process  of  design¬ 
ing  VLSI  systems.*’  ’  The  axes  of  the  Y- 
chart  correspond  to  orihogoraJ  forms  of 
system  representation,  and  the  arrows  rep¬ 
resent  design  procedures  that  translate  one 
representation  into  another.  A  top-down 
design  procedure  (that  is.  one  that 
progresses  from  more  complex  compo¬ 
nents  to  simpler  subcomponents)  can  also 
be  indicated— by  arrows  drawn  along  each 
axis  and  pointed  toward  the  origin.  While 
many  different  design  approaches  and — 
their  corresponding  Y-charls — are  possi¬ 
ble,  design  is  typically  carried  out  through 
successive  refinements.  In  this  process,  a 
component’s  functional  specification  is 
translated  first  into  a  structural  represen¬ 
tation  and  then  into  a  geometrical  descrip¬ 
tion  in  terms  of  smaller  subcomponents; 
the  functional  description  of  each  of  these 
subcomponents  must  then  be  translated 
into  structural  and  geometrical  descrip¬ 
tions  in  terms  of  even  smaller  parts,  and  so 
on.  The  line  arrows  shown  in  the  figure  are 
intended  to  convey,  in  a  general  way,  the 
flow  of  this  process  for  systolic  arrays 
versus  more  conventional  systems.  Since 
a  systolic  array  consists  of  a  large  number 
of  a  few  types  of  modules,  the  process  of 
refining  the  overall  system  and  designing 
every  subcomponent  is  faster  and  simpler 
than  it  is  in  systems  with  the  same  size  but 
a  much  larger  number  of  module  types. 


Table  I.  Applications  for  which  systolic 
designs  are  available. 


Signal  and  Image  Processing  and 
Pattern  Recognition 

FIR,  HR  filtering,  and  ID 
convolution 

2D  convolution  and  correlation 
Discrete  Fourier  Transform 
Interpolation 

I D  and  2D  median  filtering 
Geometric  warping 
Feature  extraction 
Order  statistics 

Minimum-distance  classification 
Covariance  matrix  computation 
Template  matching 
Seismic  signal  classification 
Cluster  analysis 
Syntactic  pattern  recognition 
Radar  signal  processing 
Curve  detection 
Dynamic  scene  analysis 
Image  resampling 
Scene  matching 

Matrix  Arithmetic 

Matrix-matrix  multiplication 
Matrix  triangularizaticn 
QR  decomposition 
Sparse-matrix  operations 
Solution  of  triangular  linear  systems 

Non-Numeric  Applications 

Data  structures— stacks  and  queues, 
sorting 

Graph  algorithms — transitive  closure, 
minimum  spanning  trees 
Connected  components 
Language  recognition 
Dynamic  programming 
Arithmetic  arrays 
Relational  database  operations 
Algebra 


This  is  conveyed  graphically  in  Figure  1  by 
means  of  large  arrows  showing  that  in  the 
design  of  a  systolic  array,  one  can  proceed 
faster  and  more  directly  to  the  design  of 
lower-level  components  of  the  system  than 
In  traditional  design. 

Commercially  available  systolic-array 
chips  with  10  to  100  simple,  1-bit  proces- 
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Figure  1.  A  Y-chart  (hai  shows  the  process  of  designing  algorithmically  specified 
V  LSI  digital  systems. 


sors  exist;  these  chips  sell  for  less  than  one 
hundred  dollars  apiece.  Other  chips, 
including  microprocessors  and  digital¬ 
processing  chips,  both  of  which  can  be 
used  as  building  blocks  in  systolic  arrays, 
are  also  available— at  even  lower  cost.  Sys¬ 
tolic  arrays  with  thousands  of  processors 
can  be  built  by  assembling  many  such 
building  blocks  (chips)  at  total  prices  that 
range  from  ten  thousand  to  a  hundred 
thousand  dollars  and  depend  on  the  com¬ 
plexity  of  each  processor. 


Parallel/pipciined  processing.  Systolic 
arrays  derive  their  computational  effi¬ 
ciency  from  multiprocessing  and  pipelin¬ 
ing.  Xfuliiprocessing  is  a  natural 
consequence  of  the  activiiies  going  on 
simultaneously  in  various  processing  ele¬ 
ments  of  the  array.  Pipelining  can  be 
thought  of  as  a  form  of  multiprocessing 
that  optimizes  resource  utilization  and 
takes  advantage  of  dependencies  among 
computations.  In  systolic  arrays,  data 
pipelining  reduces  the  input/output- 
bandwidth  requirements  by  allowing  a 
data  item  to  be  reused  once  it  enters  the 
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array.  Typically,  inputs  enter  the  array 
through  peripheral  processing  elements 
and  are  propagated  to  neighboring 
processing  elements  for  further  process¬ 
ing.  These  movements  of  data  through  the 
array  take  place  both  along  a  fixed  direc¬ 
tion  in  which  a  link  exists  between  neigh¬ 
boring  processing  elements  and  in  a 
periodic  manner. 

In  addition  to  data  pipelining,  systolic 
arrays  are  also  characterized  by  computa¬ 
tional  pipelining,  in  which  information 
flows  from  one  processing  element  to 
another  in  a  prespecified  order.  This  infor¬ 
mation  can  be  interpreted  by  the  receiver 
as  data,  control,  or  a  combination  of  both . 
Each  output  IS  computed  by  the 
execution— at  different  times  and  in  a 
predetermined  sequence — of  several  oper¬ 
ations  in  a  number  of  processing  elements; 
the  execution  is  performed  in  such  a  way 
that  the  output  generated  by  one  process¬ 
ing  element  is  used  as  an  input  by  a  neigh¬ 
boring  processing  element.  While 
operations  can  occur  as  data  flows  through 
each  processor,  the  overall  computation  is 
not  a  dataflow  computation,  since  the 
operations  are  executed  according  to  a 


schedule  determined  bv  ;  he  svstolic  ar'iiv 
design.  .After  a  processing  element  gener- 
ales  an  iniermcdiaie  output  and  sends  this 
output  (j  the  clement ^  neignboring 
processing  elements.  ;.he  element  computes 
another  intermediate  ouip'ji  .a,  a  .-esult. 


svsiohc  jrruv  ir 
throughputs. 

.Applications  and  algorithms.  Algo¬ 
rithms  vuuable  for  .mniementafon  ir.  s\s- 
tolic  arrays  can  be  lound  in  mans  .ippiica- 
tions.  such  as  digital  ogii  n  and  image 
processing,  linear  afgebra,  paitein  recog¬ 
nition,  linear  and  dvnamic  programming, 
and  graph  problems,  in  fact,  most  oi  the 
algorithms  in  the  listed  apolications  are 
computationally  intensive  and  require  sys¬ 
tolic  architectures  for  their  implementa¬ 
tions  when  used  in  real-time  -.nvironments. 
The  acceptance  of  this  tact  is  evidenced  by 
the  existence  o!  prototype  and  production 
systolic  arrays  for  modern  real-time  digi¬ 
tal  signal  processing  systems.  The 
manufacturers  of  these  arrays  include, 
among  others,  companies  such  as  ESL- 
TRW,  Hughes,  NCR.  GE,  Hazeiiitie.  and 
Motorola.  When  systolic  arrays  were  first 
proposed,  they  were  intended  for  applic,'!- 
tions  with  two  important  sets  of  charac¬ 
teristics.  First,  these  applications  require 
high  throughput  and  large  processing 
bandwidth,  possibly  at  the  cost  of 
increased  response  time.  In  other  words, 
it  is  more  important  to  keep  up  with  the 
How  of  data  than  tc  generate  a  set  of  out¬ 
puts  for  a  given  set  of  inputs  as  quickly  as 
possible.  Second,  these  applications  can  be 
efficiently  supported  by  algorithms  that 
can  be  implemented  on  arrays  consisting 
of  a  few  tvpes  of  simple  processing  ele¬ 
ments;  the  arrays  have  simple  controls  and 
input/output  ports  in  :he  peripheral 
processing  elements.  These  algorithms  are 
characterized  by  repealed  computations  of 
a  few  types  of  relatively  simple  operations 
that  are  common  :o  many  input  data 
items,  (dften  the  algorithms  can  be 
described  by  programs  with  nested  loops 
or  by  recurrence  equations  thai  describe 
computations  performed  on  inde.xed  data. 
In  addition,  the  pattern  of  generation  and 
usage  of  data  by  different  operations  dis¬ 
plays  some  regularity  and  uniformity, 
which  means  that  the  resulting  communi¬ 
cation  requirements  can  be  met  bv  (he 
localized  interconnections. 
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Implementation  issues 

Given  ihe  technical  and  economic  prin¬ 
ciples  that  assure  the  soundness  of  the 
systolic-array  concept,  one  needs  to  con¬ 
sider  the  issues  involved  in  implementing 
a  system  for  a  specific  application.  Some 
of  these  issues  are  briefly  discussed  here. 

General-purpose  and  special-purpose 

systolic  systems.  Typically,  a  systolic  array 
can  be  thought  of  as  an  algorithmically 
specialized  system  in  the  sense  that  its 
design  retlec's  the  requirements  of  a  spe¬ 
cific  algorithm.  However,  it  may  be  desir¬ 
able  to  design  systolic  arrays  that  are 
capable  of  efficiently  e.xecuting  more  than 
one  algorithm  for  one  application  or  more. 
Two  approaches  are  possible  in  designing 
these  ■large-purpose”  systems,  and  a 
compromise  between  the  two  is  often 
found  in  many  actual  implementations. 
One  approach  is  based  on  adding  hard¬ 
ware  mechanisms  so  as  to  reconfigure  the 
topology  and  interconnection  pattern  of 
the  systolic  array  and  to  emulate  the 
requirements  of  a  specialized  design.  A 
concrete  e.xample  of  this  approach  is  the 
Configurable  Highly  Parallel  computer 
(CHiP),'*  which  has  a  programmable  lat¬ 
tice  of  switches  for  reconfiguration  pur¬ 
poses.  The  other  approach  uses  software 
to  map  different  algorithms  into  a  fixed- 
array  architecture.  As  is  the  case  with  the 
approach  behind  other  general-purpose 
parallel  computers,  this  approach  may 
require  the  use  of  programming  languages 
capable  of  expressing  parallel  computa¬ 
tions.  as  well  as  the  development  of  trans¬ 
lators.  operating  systems,  and  pro¬ 
gramming  aids.  These  requirements  apply, 
tor  e.xample,  in  the  case  of  Warp,''  a  sys¬ 
tolic  array  developed  at  Carnegie  Mellon 
University.  For  each  algorithm,  the 
designer  needs  to  identify  the  efficient  sys¬ 
tolic  designs  and  mappings  and  the  appro¬ 
priate  techniques  to  use.  The  issue  of 
appropriate  techniques  is  of  great  impor¬ 
tance,  since  the  final  oerformance,  cost, 
and  correctness  of  the  design  are  governed 
by  these  techniques. 

Design  and  mapping  techniques.  To 

synthesize  a  systolic  array  from  the 
description  of  an  algorithm,  a  designer 
needs  a  thorough  understanding  of  and 
familiarity  with  the  principles  behind  four 
things:  systolic  computing,  the  applica¬ 
tion,  the  algorithm,  and  the  technology. 
Such  skilled  designers  can  provide  excel¬ 
lent  heuristic  designs  for  important 
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algorithms.  However,  the  process  is  slow 
and  error  prone  and  may  require  extensive 
simulations,  and  the  resulting  designs  are 
not  guaranteed  to  be  optimal  or  correct. 
Progress  has  been  made  in  the  develop¬ 
ment  of  systematic  design  techniques  to 
automate  this  process.'”  These  techniques 
are  unlikely  to  replace  the  designers  com¬ 
pletely;  instead,  they  will  provide  tools  and 
formal  concepts  to  assist  designers  in 
searching  for  diverse  and  desirable  designs 
for  a  given  application.  .Most  of  these  tech¬ 
niques  are  concerned  with  the  derivation 
of  a  relatively  high-level  specification  of 
the  array  architecture  from  a  description 
of  the  algorithm.  Typically,  such  a  speci¬ 
fication  includes  the  size  and  topology  of 
the  array,  the  operations  pjerformed  by 
each  processing  element,  the  order  and 

Many  specialized 
arrays  can  be  seen  as 
hardware 

implementations  of  a 
given  algorithm. 


timing  of  data  communication,  and  inputs 
and  outputs.  To  a  limited  extent,  these 
techniques  can  take  into  account  techno¬ 
logical  factors  and  the  relationship  of  the 
systolic  array  itself  to  the  rest  of  the  sys¬ 
tem.  However,  they  are  not  complete:  they 
can  only  be  used  at  the  specification 
level — and  only  in  an  indirect  manner 
there.  Until  more  is  learned  about  design 
techniques  that  can  be  used  conveniently 
for  detailed  integration  of  system  and  tech¬ 
nology,  such  integration  problems  will 
continue  to  be  left  for  the  designer  to  solve. 

Granularity.  The  basic  operation  per¬ 
formed  in  each  cycle  by  each  processing 
element  in  the  various  systolic  arrays  can 
range  from  a  simple  bit-wise  operation,  to 
word-level  multiplication  and  addition, 
and  even  to  e.xecution  of  a  complete  pro¬ 
gram.  The  choice  of  granularity  is  deter¬ 
mined  by  the  application,  or  the 
technology,  or  both.  For  example,  appli¬ 
cations  that  use  algorithms  with  basic  bit- 
level  operators  and  data  structures  natu¬ 
rally  suggest  that  processing  elements  be  of 
a  corresponding  complexity.  The  same 
choice  of  processing  elements  might,  how¬ 
ever,  result  from  considerations  such  as 
input/output-pin  restrictions  and  the  tech¬ 
nology  that  may  be  used.  In  programma¬ 
ble  systolic  arrays,  the  granularity  may 
also  be  determined  by  trade-offs  between 


the  desired  degree  and  level  of  program¬ 
mability.  The  Saxpy  Matrix- 1"  is  an 
example  of  a  programmable  systolic  com¬ 
puter  with  large  granularity,  whereas  bit- 
level  systolic  arrays,  like  those  discussed  by 
J.V,  McCanny  and  J.G.  .McWhirter,*  are 
special-purpose  designs  with  low 
granularity. 

Extensibility-  Many  specialized  systolic- 
arrays  can  be  regarded  as  hardware 
implementations  of  a  given  algorithm. 
This  view  holds  when  there  is  a  direct  cor¬ 
respondence  between  the  operations  and 
variables  of  the  algorithm  and,  respec¬ 
tively,  the  processing  elements  and  wire 
links  of  the  systolic  array.  In  such  a  case, 
the  systolic  processor  can  execute  only  a 
given  algorithm  that  is  designed  for  a  prob¬ 
lem  of  a  specific  size.  If  one  wishes  to  exe¬ 
cute  the  same  algorithm  for  a  problem  of 
a  larger  size,  then  either  a  larger  array  must 
be  built  or  the  problem  must  be  parti¬ 
tioned.  The  first  approach  is  easy  to  con¬ 
ceptualize  and  simply  requires  that  more 
processing  elements  be  used  to  construct 
an  enlarged  version  of  the  original  array. 
However,  as  regards  implementation,  one 
must  remember  that  there  may  be  factors 
that  do  not  affect  performance  in  small 
arrays  but  might  affect  it  in  larger  systems. 
These  factors  include  clock  synchroniza¬ 
tion.  reliability,  power  requirements,  chip- 
size  limitations,  and  input/output-pin 
constraints. 

Clock  synchronization.  In  large  syn¬ 
chronous  systolic  arrays,  clock  lines  of 
different  lengths  can  introduce  clock 
skews  and  may  require  that  a  slower  clock 
be  used.  Possible  approaches  that  avoid 
this  problem  of  clock  skews  include 
designing  systolic  arrays  that  do  not  allow 
data  to  flow  in  opposite  directions  and 
using  efficient  layouts  of  the  clock  distri¬ 
bution  network.'*  An  alternative  to  the 
design  of  a  globally  synchronous  array  is 
to  achieve  a  self-timed  system  through  the 
use  of  asynchronous  handshaking 
mechanisms  established  between  neigh¬ 
boring  processing  elements.  These  self- 
timed  implementations  are  commonly 
referred  to  as  wavefront  arrays.  ’ 

Reliability.  Simple  laws  of  probability 
can  be  used  to  explain  why  increasingly 
large  arrays  are  decreasingly  reliable  unless 
redundancy  is  incorporated  and  fault- 
tolerance  mechanisms  are  available.  In 
fact,  the  reliability  of  an  array  of  proces¬ 
sors  is  equal  to  that  of  a  processor  raised 
to  a  power  of  the  number  of  processors  in 
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the  array.  Since  the  reliability  of  a  proces¬ 
sor  is  a  value  less  than  one,  the  reliability 
of  the  global  array  quickly  approaches 
zero  as  the  number  of  processors  increases. 
Fault  tolerance  requires  that  faults  be 
detected  and  located  so  that  faulty  process¬ 
ing  elements  can  be  replaced  by  opera¬ 
tional  spares  through  an  appropriate 
reconfiguration  scheme.  .A  fault-tolerant 
systolic  array  may  need  additional  hard¬ 
ware  to  meet  these  requirements.  In  addi¬ 
tion,  if  time  redundancy  is  used  or  system 
operation  needs  to  be  suspended  for  test¬ 
ing  purposes,  the  fault-tolerant  array  can 
be  slower  than  the  original  one.  A  good 
fault-tolerant  design  has  as  its  goal  max¬ 
imizing  reliability  while  minimizing  the 
corresponding  overhead.  In  systolic 
arrays,  possible  approaches  to  fault  toler¬ 
ance  include  simple  extensions  of  well- 
known  techniques  used  in  conventional 
digital  systems.  However,  these  techniques 
do  not  take  advantage  of  the  characteris¬ 
tics  of  either  systolic  arrays  or  the 
algorithms  they  execute.  Novel  and  suc¬ 
cessful,  though  general,  fault-tolerance 
schemes'''  that  take  advantage  of  these 
characteristics  have  been  proposed  for  sys¬ 
tolic  arrays. 

Partitioning  of  large  problems.  When  it 
is  necessary  to  execute  a  large  problem 
without  building  a  large  systolic  array,  the 
problem  must  be  partitioned  so  that  the 
same  algorithm  can  be  used  to  solve  the 
smaller  problem  and  so  that  an  array  of 
small,  fixed  size  can  be  used.  The  main 
concerns  are  to  avoid  rendering  the  pani- 
tioned  algorithm  incorrect  and  to  avoid 
increasing  the  complexity  of  the  design  sig¬ 
nificantly.  One  approach  identifies  algo¬ 
rithm  partitions  and  an  order  of  execution 
of  these  partitions  such  that  correctness  is 
preserved  and  the  original  array  can  be 
used  to  execute  each  partition , '  ’  The  per¬ 
ceived  result  of  this  approach  is  that  the 
array  "travels”  through  the  set  of  compu¬ 
tations  of  the  algorithm  in  the  right  order 
until  it  “covers”  all  the  computations. 
.Another  approach  attempts  to  restate  the 
problem  to  be  solved  so  that  the  problem 
becomes  a  collection  of  smaller  problems 
that  IS  similar  to  the  original  one  and  that 
can  be  solved  by  the  given  systolic  array. 
While  this  second  approach  has  less  gener¬ 
ality  and  is  harder  to  automate  than  the 
first  approach,  it  may  have  better  pierform- 
ance  when  it  is  applicable. 

Automated  desiga  tools.  The  processing 
elements  and  module  libraries  play  an 
important  role  in  making  the  process  of 


designing  special-purpose  arrays  of 
processing  elements  faster  and  more  cost- 
effective.  In  addition  to  the  many  existing 
tools  for  designing  VLSI  and  WSl  systems 
that  can  be  readily  used  in  this  process,  the 
regularity  and  algorithmic  nature  of  sys¬ 
tolic  arrays  permits  the  use  of  high-level 
silicon  compilers.  At  this  time,  the  devel¬ 
opment  process  is  not  fully  automated;  ihe 
process  will  depend  on  future  progress  in 
design  automation  and  computer-aided 
design  tools 

Lniversal  building  blocks.  Systolic 
arrays  cost  less  to  implement  than  other 
arrays  because  of  their  extensive  replica¬ 
tion  of  a  small  number  of  simple,  basic 
modules  and  because  of  their  highly  dense 
and  efficient  layouts.  It  is  worthwhile  for 


Integrating  systolic 
arrays  into  existing 
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the  simple  building  blocks  to  be  carefully 
designed  and  optimized,  since  the  costs 
involved  are  amortized  over  a  large  num¬ 
ber  of  replicated  circuits.  The  modular 
design  of  systolic  arrays  allows  designers 
who  want  rapid  prototyping  of  their  ideas 
to  use  off-the-shelf  devices,  such  as 
microprocessors,  floating-point  anthmetic 
units,  and  memory  chips.  However,  these 
parts  may  not  be  designed  for  implement¬ 
ing  systolic  arrays  and  may  therefore  be 
inadequate  to  meet  the  design  require¬ 
ments.  This  has  led  to  the  development  of 
“universal  building  blocks”— chips  that 
can  be  used  for  many  systolic  arrays.  The 
cost  of  such  development  is,  therefore, 
amortized  over  replicated  modules  in 
many  arrays  rather  than  concentrated  in 
simply  one  array  Commercially  available 
chips  that  are  worthy  of  consideration  as 
basic  modules  include  the  INMOS  T rans- 
puter,  the  TI  T.MS320I0  and  TMS32020, 
the  NEC  datallow  chip  iiPD'' 281,  Analog 
Devices'  ADSP2100,  the  Fu)itsu  MB8764, 
and  the  National  L.V132900.  Problems 
involved  in  ihe  use  of  programmable 
building  blocks  include  developing  pro¬ 
gramming  tools  to  aid  designers  and 
providing  support  for  llexible  intercon- 
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Although  systolic  arrays  provide  extensive 


throughput,  their  integration  into  existing 
systems  may  be  nontrivial  because  of  the 
extensive  inpui/output  bandwidth 
involved,  espiecially  when  a  problem  has  to 
be  partitioned  and  input  data  have  to  be 
accessed  repeatedly.  .Additional  problems 
that  have  to  be  solved  for  systems  with  a 
large  number  of  systolic  arrays  include  the 
interconnections  with  the  host,  the  mem¬ 
ory  subsystem  to  support  the  systolic 
arrays,  the  buffering  and  access  of  data  to 
meet  the  special  input/outpui  data  distri¬ 
butions.  and  the  multiplexmg  and  demul- 
tiplexing  of  data  when  there  are 
insufficient  mput/ouiput  pons.  The  prob¬ 
lems  that  must  be  laced  are  exemplified  by 
Mosaic,  a  project  being  carried  out  at 
ESL.  The  sy.stem  consis's  of  a  statically 
scheduled  crossbar  switch  that  connects 
multiple  Warp  processors,  each  with  local 
memory  modules,  into  a  inacropipeiine. 
The  local  memory  modules  are  used  to 
store  input  data  and  restructure  them  into 
the  required  input  format. 


The  future 

By  the  year  2000,  it  will  be  possible  to 
build  integrated  circuits  with  one  billion 
transistors— more  than  one  thousand 
times  the  number  of  devices  available  in 
today’s  densest  integrated  circuits.'* 
These  incredibly  large  circuits  will  use 
0. 1 —micron  geometries  made  possible  by 
advanced  optical,  electron-beam,  ion- 
beam,  or  X-ray  lithography.  While  the 
high  cost  of  setting  up  integrated-circuit 
factories  that  can  handle  these  technolo¬ 
gies  will  cenainly  impact  the  initial  cost  per 
chip,  the  main  manufacturing  limitations 
will  be  in  the  design,  verification,  testing, 
and  packaging  of  such  large  circuits.  In 
addition,  the  percentage  of  the  chip  area 
dedicated  to  interconnections  could 
increase  to  more  than  80  percent.  Systolic 
arrays  will  take  advantage  ol  submicron 
technologies  without  suffering  from  the 
problems  just  mentioned,  since  they  are 
modular,  have  regular  interconnections, 
and  are  extensible.  By  the  year  2000, 
mature  design  and  programming  tools  and 
extensive  knowledge  of  suitable  applica¬ 
tions  and  algorithms  will  probably  render 
systolic  arrays  the  architecture  of  choice 
for  submicron  circuits  designed  for  digital 
signal  processing,  fast  arithmetic,  sym¬ 
bolic  processing,  and  intelligent  databases. 

Systolic  arrays  have  triggered  extensive 
related  work  and  research  in  the  areas  of 
processor-array  architecture,  algorithm 
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design  and  anaiysis,  and  paraJlel  program¬ 
ming.  These  areas  are  often  identified  as 
systolic  architecture,  systolic  algorithms, 
3R(Si  systolic  computing,  respectively.  Asa 
consequence,  the  principles  behind  systolic 
arrays  have  gained  an  enlarged  scope.  That 
is,  systolic  architectures  are  not  necessar¬ 
ily  arrays  of  processors:  systolic 
algorithms  may  be  very  complex  and  may 
not  necessarily  be  executed  in  simple 
processing  elements;  and  systolic  comput¬ 
ing  can  take  place  in  computers  other  than 
systolic  architectures.  The  prominent  fea¬ 
tures  of  systolic  arrays  are  the  processing 
elements,  which  implement  processes,  and 
the  regular  interconnection  of  multiple 
processing  elements.  The  processing  ele¬ 
ments  and  the  interconnection  of  process¬ 
ing  elements  can  be  implemented  in 
software,  general-purpose  microproces¬ 
sors,  or  specialized  hardware.  Because  of 
this  variety  of  implementation  possibili¬ 
ties,  systolic  arrays  have,  since  the  late 
seventies,  evolved  to  become  cellular  com¬ 
puting  at  the  algorithmic,  programming, 
architectural,  and  hardware  levels.  We 
are,  therefore,  witnessing  a  trend  in  which 
systolic  computing  is  becoming  a  pervasive 
form  of  multiprocessing.  □ 
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ABSTRACT 

A  "state  model"  is  proposed  for  solving  the  problem  of  routing  and  rerout¬ 
ing  messages  in  the  Inverse  Augmented  Data  Manipulator  (lADM)  network. 
Using  this  model,  necessary  and  sufficient  conditions  for  the  reroutability  of 
messages  are  established,  and  then  destination  tag  schemes  are  derived.  These 
schemes  are  simpler,  more  efficient  and  require  less  complex  hardware  than  pre¬ 
viously  proposed  routing  schemes.  Two  destination  tag  schemes  are  proposed. 
For  one  of  the  schemes,  rerouting  is  totally  transparent  to  the  sender  of  the 
message  and  any  blocked  link  of  a  given  type  can  be  avoided.  Compared  with 
previous  works  that  deal  with  the  same  type  of  blockage,  the  timeXspace  com¬ 
plexity  is  reduced  from  O(logA^)  to  0(1).  For  the  other  scheme,  rerouting  is 
possible  for  any  type  of  link  blockage.  A  universal  rerouting  algorithm  is  con¬ 
structed  based  on  the  second  scheme,  which  finds  a  blockage-free  path  for  any 
combination  of  multiple  blockages  if  there  exists  such  a  path,  and  indicates 
absence  of  such  a  path  if  there  exists  none.  In  addition,  the  state  model  is  used 
to  derive  constructively  a  lower  bound  on  the  number  of  subgraphs  which  are 
isomorphic  to  the  Indirect  Binary  N-Cube  network  in  the  LADM  network.  This 
knowledge  can  be  used  to  characterize  properties  of  the  LADM  networks  and  for 
permutation  routing  in  the  LADM  networks. 

Index  terms  -  cube  network,  data  manipulator  network,  destination-tag  routing, 
fault  tolerance,  interconnection  network,  multiproce.ssor,  parallel  processing, 
state  model. 
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1.  Introduction 


This  paper  discusses  novel  and  efficient  techniques  for  routing  and  rerout¬ 
ing  messages  in  the  Inverse  Augmented  Data  Manipulator  (lADM)  network  [9j. 
These  results  are  based  on  a  new  approach,  the  "state  model,"  which  character¬ 
izes  and  correlates  the  topologies  of  the  LADM  and  Indirect  binary  n-cube  net¬ 
works,  and  leads  to  efficient  exploitation  of  the  redundancy  available  in  the 
lADM  network. 

Considerable  research  has  been  dedicated  to  the  design  of  multistage  inter¬ 
connection  networks  for  multiprocessor  systems.  The  class  of  data  manipulator 
networks,  introduced  in  [3],  includes,  among  others,  the  Augmented  Data  Mani¬ 
pulator  (ADM)  network  [17],  the  LADM  network  [9]  and  the  Gamma  network 
[13] [14].  The  lADM  network  and  the  ADM  network  differ  only  in  that  the  input 
side  of  one  of  them  corresponds  to  the  output  side  of  the  other  and  vice  versa. 
The  Gamma  and  the  lADM  networks  are  topologically  equivalent;  however, 
they  use  switches  of  different  types.  Each  3X3  crossbar  switch  used  in  the 
Gamma  network  can  connect  simultaneously  all  three  inputs  to  all  three  out¬ 
puts  whereas  each  switch  used  in  the  LADM  network  can  connect  only  one  of  its 
three  inputs  to  one  or  more  of  its  three  outputs.  The  main  interest  of  this 
paper  is  the  study  of  the  lADM  network;  both  the  one-to-one  and  permutation 
routings  arc  considered.  The  schemes  proposed  for  routing  and  rerouting  mes¬ 
sages  in  the  LADM  network  are  also  applicable  to  the  Gamma  network. 

Perhaps  the  most  popular  class  of  multistage  networks  is  the  multistage 
ciibe-type  networks  such  as  the  Indirect  Binary  N-Cube  jl5j,  Omega  |6),  Baseline 
[20],  Generalized  Cube  [18],  STARAN  Hip  (2|  and  a  special  case  of  SW-Banyan 
|4|  networks.  Among  the  main  advantages  of  the.se  networks  are  their  very 
efficient  destination  tag  routing  schemes,  partitionability,  0{N\og2^)  <^o.st  and 
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ability  to  pass  useful  permutations  [16].  Some  results  of  this  paper  are  based  on 
characteristics  of  the  Indirect  Binary  N-Cube  network  (hereon  referred  to  as  the 
ICvhe  netxjuork).  Since  the  cube-type  networks  mentioned  above  are  all  topologi¬ 
cally  equivalent  |16][17)|20]|21),  the  results  in  this  paper  are  also  relevant  to  any 
of  them. 

The  ICube  network  is  composed  of  n  =  logN  stages  labeled  from  0  to  n— 1. 
Each  stage  consists  of  2N  connection  links  and  N  interchange  (switches)  boxes. 
The  structure  of  the  network  is  such  that  two  input  links  of  an  interchange  box 
differ  only  in  the  i-th  bit  of  their  labels;  the  upper  links  have  a  "O"  in  the  i-th 
bit  and  the  lower  links  have  a  "l."  Figure  1  illustrates  an  ICube  network  of  size 
N—S  and  two  possible  states  of  an  interchange  box,  "straight"  and  "exchange." 
Since  this  paper  considers  only  one-to-one  and  permutation  routing,  broadcast 
states  are  not  shown. 

The  /ADA/ network  is  composed  of  n  stages  labeled  from  0  to  n— 1.  Each 
stage  consists  of  3N  connection  links  and  N  switching  elements.  An  extra 
column  of  switches  is  appended  at  the  end  of  the  last  stage  as  the  output 
switches  and  is  referred  to  as  stage  n.  Each  switch  j  at  stage  i  has  three  out¬ 
put  links  to  switches  (j— 2')  mod  N,  j  and  (ji+2’)  mod  N  of  the  succeeding 
stage.  Flach  switch  selects  one  of  its  input  links  and  connects  it  to  one  or  more 
output  links.  Figure  2  illustrates  an  lADM  network  of  size  jV=8. 

In  a  multistage  interconnection  network,  the  path  connecting  the  source  of 
a  message  to  its  destination  is  determined  by  a  routing  scheme  that  specifies  the 
switching  state  of  each  switch  in  the  path.  Routing  schemes  are  considerably 
simpler  for  the  cube-type  networks  than  for  the  data  manipulator-type  net¬ 
works.  In  cube-type  networks,  the  interchange  box  at  stage  i  needs  to  examine 
the  i-th  bit  of  the  binary  representation  of  the  destination  address  of  an 
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incoming  message.  If  the  t-th  bit  is  0,  then  the  upper  output  of  the  box  is 
taken.  If  the  i-th  bit  is  1,  the  lower  output  of  the  box  is  taken.  These  schemes 
are  known  as  destination  tag  routing  schemes  [6]  and  are  extremely  efficient  and 
simple  to  implement.  Unlike  cube-type  networks,  in  the  lADM  and  other  data 
manipulator-type  networks  there  are  several  paths  between  any  source  s  and 
destination  d  (s^d)  and  each  switching  element  has  at  least  three  switching 
states.  Previously  proposed  routing  schemes  [9l[l0)[l3]  for  the  lADM  network 
can  be  thought  of  as  distance  tag  schemes;  that  is,  they  require  calculation  of 
the  distance  from  source  to  destination  in  order  to  generate  routing  and  rerout¬ 
ing  tags.  The  rerouting  schemes  in  these  works  are  basically  finding  an  alter¬ 
nate  representation,  which  specifies  an  alternate  routing  path,  for  the  distance. 

McMillen  and  Siegel  [9j  proposed  three  dynamic  rerouting  techniques  for 
the  lADM  network  for  avoiding  faulty  or  blocked  ±2’  (nonstraight)  links.  The 
first  and  the  second  schemes  require  that  switches  be  capable  of  performing 
two’s  complement  and  -1-2*  addition  operations,  respectively.  The  third  scheme 
requires  one  extra  tag  bit  which  is  dynamically  updated  as  the  message  pro¬ 
pagates  toward  the  destination.  In  [lO],  the  work  of  [9]  was  expanded,  and  a 
single-stage  look-ahead  scheme  was  proposed  to  avoid  certain  type  of  straight 
link  faults.  This  improved  scheme  also  requires  two’s  complement  operations. 

Parker  and  Raghavendra  [13]  used  redundant  number  representation  and 
proposed  an  algorithm  capable  of  finding  all  routing  paths,  which,  effectively, 
are  the  redundant  number  representations  for  the  distance  between  the  source 
and  the  destination.  Because  of  the  complexity  of  the  algorithm,  the  cost  of 
computation  is  prohibitively  large  so  that  it  is  infeasible  to  implement  the  algo¬ 
rithm  in  order  to  achieve  dynamic  routing  (l9|.  In  addition,  although  the  algo¬ 
rithm  can  generate  all  routing  tags  for  any  distance,  there  is  no  specific  work  on 


rerouting  schemes  in  [13](14). 

Lee  and  Lee  [7]  proposed  signed  bit  difference  tag  and  destination  tag  local 
control  algorithms  for  the  ADM  and  lADM  networks  that  require  no  computa¬ 
tion  for  the  distance  between  the  source  and  the  destination.  But  their  local 
control  algorithms  can  only  find  one  routing  path  for  each  source  and  destina- 
ti .air.  If  the  need  for  rerouting  arises,  they  still  resort  to  the  distance  tag 
schemes  to  find  alternate  paths. 

Past  research  has  shown  interesting  relationships  between  data  manipula¬ 
tor  and  cube- type  networks.  For  example,  because  it  is  possible  to  embed  the 
Generalized  Cube  network  in  the  ADM  network  [l]|l7],  the  set  of  interconnec¬ 
tions  implementable  by  the  ADM  network  is  a  superset  of  that  of  the  General¬ 
ized  Cube  network.  This  fact  and  the  existence  of  multiple  paths  between  any 
source  s  and  destination  d  {s^d)  in  the  ADM  network  suggests  that  the  ADM 
network  can  be  thought  of  as  a  fault-tolerant  Generalized  Cube  network. 
Analogously,  the  LADM  network  can  be  regarded  as  a  fault-tolerant  ICube  net¬ 
work^.  Since  the  permutations  realizable  by  cube-type  networks  are  well  stu¬ 
died,  the  identification  of  possible  embeddings  of  the  ICube  network  in  the 
lADM  network  can  help  characterize  the  permutation  capabilities  of  this  net¬ 
work.  A  contribution  to  the  precise  understanding  of  these  notions  is  made  in 
this  paper;  it  consists  of  the  identification  of  a  large  number  of  distinct  sub¬ 
graphs  of  the  LADM  network  that  are  isomorphic  to  the  ICube  network. 

Section  2  of  this  paper  introduces  a  state  model  to  describe  and  correlate 
topologies  of  the  ICube  network  and  the  LADM  network.  Necessary  and 

While  topologically  equivalent,  the  ICube  and  Generalised  Cube  I/O  ports  are 
addressed  so  that  their  inter-relationship  is  the  same  as  that  of  the  lADM  and  ADM 
network,  i.e.  the  input  and  output  sides  are  interchanged. 


sufficient  conditions  to  perform  rerouting  in  the  lADM  network  are  derived  in 
Section  3  .  In  Section  4  two  routing  and  rerouting  schemes  are  proposed  based 
on  the  theory  developed  in  Section  3,  together  with  a  discussion  of  their  merits 
and  implementation  considerations.  A  universal  rerouting  algorithm  is  proposed 
in  Section  4,  which  can  deal  with  any  combination  of  multiple  link  blockages. 
A  class  of  subgraphs  in  the  LADM  network  that  are  isomorphic  to  the  ICube 
network  are  identified  in  Section  6,  and  it  is  shown  how  to  reconfigure  the 
LADM  network  under  certain  link  faults  to  pass  the  cube-admissible  permuta¬ 
tions.  Finally,  Section  7  summarizes  the  results  presented  in  this  paper. 

2.  State  Model  Descriptions  for  the  ICube  and  LADM  Networks 

Multistage  networks  can  be  modeled  as  graphs  by  treating  interchange 
boxes  (also  called  switching  elements)  and  links  of  the  network  as  nodes  and 
edges  of  the  graph,  respectively.  Another  equivalent  graph  model  [lUSj  results  if 
interchange  boxes  are  associated  with  edges,  and  links  with  nodes.  Both  models 
are  exemplified  in  Figures  1  and  3  for  the  ICube  network.  The  LADM  network  is 
shown  in  Figure  2  according  to  the  first  model.  The  design  of  switches  based  on 
both  models  is  discussed  in  [llj.  Clearly,  the  ICube  network  in  Figure  3  can  be 
regarded  as  being  a  subgraph  of  the  LADM  network  in  Figure  2.  Henceforth, 
the  second  model  is  always  assumed  when  referring  to  the  ICube  network  (i.e. 
Figure  2)  and  the  first  model  is  assumed  when  dealing  with  the  LADM  network. 

With  respect  to  these  graph  models,  the  nodes  and  the  edges  of  the  graph 
refer  to  the  switches  and  the  links  of  the  networks,  respectively.  The  number  of 
switches  at  each  stage  of  a  network  is  denoted  N  and  n  =  log2iV  refers  to  the 
number  of  stages.  The  switches  of  each  stage  are  labeled  from  0  to  N— 1  from 
the  top  to  the  bottom.  Any  integer  j  has  a  binary  representation 


JoJi  ■  ■  ■  Jn-i>  where  j„_i  is  the  most  signmcant  bit  and  n  denotes  the  number 
of  bits.  The  notation  means  the  bits  of  j  starting  at  and  ending  at  j^, 
where  p  '^q.  Bit  is  I’s  complement  of  bit  j^.  Throughout  this  paper,  j  and 
j+a,  where  a  is  some  constant,  are  reserved  to  represent  labels  of  switches. 
Also  modulo  N  arithmetic  is  assumed,  e.g.  j+a  implies  [j+a)  mod  N.  The 
notation  is  used  to  indicate  that  a  switch  j  belongs  to  stage  i  and 

(/g5,  ,  j  +  is  used  to  represent  a  link  at  stage  t  joining  j  65,  and  j 
A  sequence  of  switches  of  contiguous  stages  (^65,  ,  j  ,  •  •  •  ,  j  65,^^)  is 

used  to  represent  a  path  from  to  j 

Notation  and  terminology  required  for  the  characterization  of  network 
topologies  and  destination  tag  routing  schemes  are  introduced  next.  A  switch  j 
of  stage  »  is  an  even^  switch  if  =  0  and  an  odd^  switch  if  j,-  =  1.  Figure  2 
identifies  even,  and  odd,  switches  at  different  stages  of  the  lADM  network  of 
size  iV=8.  Define  the  functions  and  A^,-  that  represent  connection  Jinks  at 
stage  i  as 


if  J  is  an  even^  switch  and  t,  =0, 
or  if  j  is  an  odd,  switch  and  tj=l 
if  j  is  an  odd,  switch  and  t,  =0 
if  j  is  an  even,  switch  and  t,=l 


AC.(;,t,)  =  -AC,(j,t,) 


Also,  define  the  functions  C,(j,ti)  =  j  +  ACi(y,t,)  and 
C,  (j,tj)  =  J  +  ACj(j  ,ti).  These  definitions  imply  the  following  lemma  of  funda¬ 
mental  importance  to  the  results  of  this  paper. 

Lemma  2A 


Jo/i 


~  Jo/i-l^«9i  +  l/n-l 

for  some  value  of  9i+i/n-i  which  depends  on  j  and  t,. 

Proof:  If  j  is  an  even,  switch  and  =  0,  then  =  j.  If  j  is 

an  oddi  switch  and  t,  =  1,  then  Ci{j,t^)  =  C,(j,t,)  =  j.  If  j  is  an  odd^  switch 
and  ti  =0,  then  C,(j,t,)  results  from  subtracting  1  from  j,.  Since  j  is  an  odd, 
switch,  =  1,  no  borrow  is  generated  and  all  remaining  bits  of  j  are 
unchanged;  however,  adds  1  to  j,,  changing  the  i-th  bit  to  0  and  alter¬ 

ing  some  of  the  bits  in  positions  t  +l,  .  .  .  ,n— 1  due  to  carry  propagation.  Simi¬ 
lar  reasoning  applies  when  j  is  an  euerij  switch  and  <,  =  !.□ 

The  notation  and  terminology  just  introduced  can  now  be  used  to  describe 
the  networks  of  interest  in  this  paper.  The  following  description  for  a  network 
in  terms  of  AC,,  AC,  ,  C,  and  C,  is  called  the  network  state  model. 

The  ICube  network  is  composed  of  rt  stages  labeh'd  from  0  to  n— 1.  Each 
stage  consists  of  2N  links  and  N  switches.  An  extra  column  of  switches  is 
appended  at  the  end  of  the  last  stage  as  the  output  switches  (Figure  .3)  and  is 
denoted  A  switch  jG5,  is  connected  to  switches  C,(j,t,)E5',  +  i,  for 

0  <  i  <  n—1,  0  <  J  <  N—1,  and  t,  =0  or  t,  =  1-  When  using  destination 
tags,  switch  routes  a  message  to  switch  C,{j,d, )G'S, +  i  where  d,  is  the  i-tk 

bit  of  the  address  of  the  message  destination. 

The  lADM  network  is  composed  of  n  stages  labeled  from  0  to  n—1.  Each 
stage  consists  of  a  column  of  TV  switches  and  3N  connection  links.  An  extra 
column  of  switches  is  appended  at  the  end  of  the  last  stage  as  the  output 
switches  and  is  denoted  S^.  A  switch  is  connected  to  switches 

+  i  and  +  ,  for  0  <  i  <  n-I,  0  ./  ^  N-l,  and  <,  =  0  or 

(,  —  1.  In  other  words,  three  links  connect  a  switch  to  the  switches  {j—2'), 
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j  and  (j+2*)  at  stage  i+1.  Sometimes  +2*  and  —2'  are  used  to  represent  links 
(i€5’,  ,  (j'+2‘)GS’,  +  i)  and  ,  (j— respectively.  The  terms  a 

straight  link  refers  to  link  (jE5,  ,  jE5,_^j)  and  a  nonstraight  link  refers  to  links 

±2*. 

According  to  the  model,  two  types  of  switches,  even^  and  oddj,  are  required 
in  the  lADM  and  ICube  networks.  Figure  4  illustrates  the  connection  links  of  a 
pair  of  even,  and  odd,  switches  for  an  ICube  and  an  lADM  network  of  size 
N=8.  The  Ac,-  function  describes  the  ICube  connections.  For  the  lADM  net¬ 
work,  the  connection  links  can  be  described  by  the  union  of  the  functions  ACj 
and  AC,.  In  practice,  evenj  and  oddj  switches  can  be  identical  and  easily  pro¬ 
grammed  (at  power-up  or  system  configuration  time)  to  behave  differently. 

There  are  two  possible  routing  behaviors  (or  states)  for  each  switch  in  an 
lADM  network.  A  switch  is  said  to  be  in  state  C  if  the  routing  is  decided  in 
accordance  with  the  function  Ci{j,ti)  and  it  is  in  the  state  C  if  the  function 
C,(j,<J  applies.  On  the  whole,  the  link  on  which  a  message  is  routed  depends 
on  whether  the  switch  is  an  even,  or  odd,  switch,  in  state  C  or  C,  and  the 
value  of  tag  bit  <, .  Also  the  term  state  of  the  network  is  used  to  denote  collec¬ 
tively  the  states  of  all  switches  in  the  netv/ork. 

The  notion  of  switch  state  is  only  conceptual;  it  can  be  implemented  by 
designing  the  switches  with  actual  logic  states  as  well  as  by  using  tags  with  n 
added  bits  specifying  the  states  of  the  switches  on  the  routing  path.  In  Section 
4,  these  and  other  aspects  of  the  actual  implementation  of  the  proposed 
schemes  are  discussed  in  detail. 


3.  Theory  behind  the  State-Based  Destination  Tag  Routing  Schemes 

Based  on  the  framework  developed  in  Section  2,  routing  problems  in  the 
lADM  network  are  now  examined.  It  is  clear  that  when  every  switch  in  the 
lADM  network  is  in  state  C ,  the  lADM  network  behaves  like  an  ICube  network 
and,  therefore,  the  destination  address  do/„_i  can  be  used  as  a  routing  tag,  i.e. 
t,  =  d,-.  More  generally,  the  following  theorem  can  be  proven. 

Thporem  .^.1  Let  d  =  do/„_i  be  the  destination  in  the  lADM  network  to  which 
a  message  is  to  be  sent.  Then  t  =  do/„_i  is  the  unique  destination  routing  tag 
to  the  destination  d  regardless  of  state  of  the  lADM  network. 

Proof:  Consider  an  arbitrary  tag  /  o/n-i  assume  that  the  lADM  network  is 
in  an  arbitrary  state.  Let  to/n-i  “  /o/n-i-  Then  each  switch  will  route  the 
incoming  message  to  either  or  C,(j,/,).  From  Lemma  2.1,  it  can  be 

reasoned  by  induction  that,  at  stage  t,  (C',(j,/j))o/j  =  {Q(jt/i))o/i  =  f  o/i' 
the  last  stage,  =  f o/n-i-  Thus  the  address  of 

the  destination  of  the  message  is  the  same  as  the  routing  tag.  This  proves  both 
the  validity  and  the  uniqueness  of  do/n-i  ^  routing  tag.  □ 

It  is  implicit  in  the  reasoning  underlying  Theorem  3.1  that  any  link  on  a 
given  path  results  from  the  appropriate  choice  of  the  state  of  the  corresponding 
switch,  i.e.  the  use  of  "link"  AC,(j,t, )  results  from  setting  jGi’,  to  state  C  and 
the  use  of  "link"  AC',  (j  ,t, )  results  from  setting  to  state  C .  Thus,  given  a 

path  to  the  destination  d,  there  is  at  least  one  network  state  for  which  the  use 
of  d  as  the  destination  tag  results  in  the  routing  of  a  message  through  that 
path. 

The  implication  of  Theorem  3.1  is  that  the  use  of  a  state  model  for  the 
lADM  network  reduces  the  problem  of  finding  alternate  routing  paths  to  that  of 
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controlling  the  states  of  the  switches  in  the  network.  Capitalizing  on  this  idea, 
the  following  theorems  show  how  alternate  routing  paths  can  be  found  in  order 
to  evade  blockages  in  the  network.  A  straight  link  blockage  occurs  if  a  straight 
link  on  the  routing  path  is  faulty  or  busy.  A  nonstraight  link  blockage  is  defined 
analogously.  The  third  type  of  blockage,  called  double  nonstraight  link  blockage, 
occurs  if  both  nonstraight  output  links  of  a  switch  in  the  routing  path  are 
faulty  or  busy.  A  switch  blockage  occurs  if  the  switch  itself  is  busy  or  faulty.  A 
switch  blockage  has  the  same  effect  as  blocking  all  of  the  switch’s  input  links 
and  can  be  transformed  into  a  link  blockages  problem  accordingly.  The  discus¬ 
sion  on  rerouting  in  this  paper  is  concerned  only  with  link  blockages. 

Theorem  2^  In  the  lADM  network,  a  change  of  the  state  of  switch  jES^  results 
in  a  different  routing  path  to  a  destination  d  if  and  only  if  a  nonstraight  output 
link  of  j  is  used  on  the  original  routing  path  to  d.  Moreover,  the  other  non¬ 
straight  output  link  of  j  is  used  on  the  new  path. 

Proof;  Changing  the  state  of  j  implies  that  the  "link"  ACi{j,ti)  is  used  instead 
of  AC,(y,<,)  or  vice  versa.  However,  if  AC,(y,t,)  =  0  then  ^Ci[j,t,)  —  0  (i.e. 

both  use  a  straight  link)  and  vice  versa.  □ 

With  regard  to  the  rerouting  schemes  proposed  in  this  paper,  the  implica¬ 
tions  of  Theorem  3.2  are  twofold.  First,  the  "if"  part  of  the  theorem  implies 
that  dynamic  rerouting  for  a  nonstraight  link  blockage  can  be  achieved  by 
changing  the  state  of  the  switch  whose  output  is  the  nonstraight  link,  which  is 
equivalent  to  rerouting  the  message  through  the  oppositely  signed  nonstraight 
link  connected  to  the  same  switch.  Thus,  the  same  subset  of  destinations  is 
reachable  from  the  two  switches  whose  input  links  are  the  two  oppositely  signed 
nonstraight  links.  Second,  the  "only  if"  part  of  the  theorem  implies  that 


dynamic  rerouting  for  a  straight  link  blockage  is  impossible.  This  is  true  in 


general  since  every  routing  path  in  the  lADM  network  can  be  the  result  of  set¬ 
ting  the  network  to  some  state.  Moreover,  if  a  path  from  stage  i  to  stage  i 
consists  of  all  straight  links  connecting  jeS^  and  t’  <  t  <  i",  then  there 

exist  no  alternate  routing  paths  from  jE-S','  to  for  otherwise  there  would 

exist  an  alternate  routing  path  branching  from  jE«?,'  and  ending  at  the  destina¬ 
tion.  The  only  resort,  if  any  at  all,  to  bypass  the  straight  link  blockage  is  to 
backtrack  to  a  switch  connected  to  a  nonstraight  link  on  the  routing  path  at 
some  preceding  stage  and  to  reroute  from  that  switch.  It  remains  to  show  that 
an  alternate  routing  path  always  exists,  provided  that  such  a  nonstraight  link 
exists.  In  fact,  the  existence  of  an  alternate  routing  path  partly  results  from 
Theorem  3.2,  as  stated  in  the  next  theorem.  Figure  5  illustrates  the  situation  in 
Theorem  3.3. 

Theorem  Consider  a  routing  path  in  the  lADM  network  to  a  destination  d 
that  contains  a  blocked  straight  link  at  stage  i.  There  exists  at  least  one  net¬ 
work  state  which  results  in  an  alternate  routing  path  that  avoids  the  same 
straight  link  blockage  at  stage  t  if  and  only  if  the  original  routing  path  to  d 
contains  a  nonstraight  link  at  stage  i—k  for  some  k,  i  >  k  >  0. 

Proof;  See  Appendix  Al.  □ 

Previous  work  [7][9](13]  implies  only  the  "if”  part  of  the  theorem,  i.e.  the 
possibility  of  using  nonstraight  link  of  opposite  sign  in  order  to  reroute  a  mes¬ 
sage  in  the  case  of  a  nonstraight  link  failure.  However,  the  "only  if"  part  of  the 
theorem  also  implies  that,  in  addition,  it  is  not  possible  to  devise  a  new  rerout¬ 
ing  scheme  capable  of  avoiding  a  backtracking  {or  look-ahead)  mechanism  in 
order  to  deal  with  straight  link  blockages. 

From  Theorem  3.2,  (for  a  given  sourcc/destination  pair)  if  the  straight  out¬ 
put  link  of  a  switch  is  on  some  routing  path,  both  nonstraight  output  links  of 


the  switch  cannot  be  used  for  routing;  if  one  of  the  nonstraight  output  links  of 
a  switch  is  on  some  routing  path,  the  other  nonstraight  link  of  the  switch  is  also 
on  another  routing  path  and  the  straight  link  of  the  switch  cannot  be  used  for 
routing.  So  for  a  given  switch,  the  output  link  blockages  that  a£fect  paths  from 
a  given  source  to  a  given  destination  can  only  be  (a)  a  nonstraight  link  block¬ 
age,  (b)  a  straight  link  blockage  or  (c)  the  double  nonstraight  link  blockage.^ 
Theorem  3.2  can  be  used  to  avoid  case  (a)  a  nonstraight  link  blockage  and 
Theorem  3.3,  case  (b),  a  straight  link  blockage.  If  case  (c)  occurs,  then  Theorem 
3.2  cannot  be  used  to  find  a  rerouting  path.  A  backtracking  scheme  proposed 
later  in  Corollary  4.2  based  on  Theorem  3.3  can  be  adapted  to  overcome  this 
type  of  blockage.  The  adapted  backtracking  scheme  is  based  on  Theorem  3.4, 
which  is  illustrated  in  Figure  6. 

Theorem  2*1  Consider  a  routing  path  in  the  lADM  network  to  a  destination  d 
that  contains  a  switch  at  stage  i  whose  both  nonstraight  output  links  are 
blocked.  There  exists  at  least  one  network  state  which  results  in  an  alternate 
routing  path  that  avoids  the  same  blocked  nonstraight  links  at  stage  i  if  and 
only  if  the  original  routing  path  to  d  contains  a  nonstraight  link  at  stage  i—k 
for  some  k,  i  >  /c  >0. 

Proof:  See  Appendix  Al.  □ 


rhjrsically  it  ia  possible  to  have  any  combination  of  blockages  of  the  output  links  of  a 
given  switch.  However,  the  possible  routing  paths  for  a  given  source/destination  pair 
can  be  affected  by  either  a  straight  link  blockage  or  a  double  nonstraight  link  blockage 
in  a  given  switch  but  never  both  types  of  blockage. 


4.  State-Based  Routing  and  Rerouting  Schemes 


In  this  section,  routing  and  rerouting  schemes  are  discussed  based  on  the 
theory  developed  in  Section  3.  As  mentioned  earlier,  the  novelty  of  the  ideas  in 
this  paper  lies  in  the  state  model  of  the  routing  behavior  of  each  switch.  In  pre¬ 
viously  proposed  approaches,  routing  is  determined  solely  by  tag  bits.  Accord¬ 
ing  to  the  state  model,  the  switching  action  of  each  network  element  is  concep¬ 
tually  determined  by  its  relative  position  (i.e.  an  even,  or  odd,  switch),  its  state 
(i.e.  C  or  C)  and  a  destination  tag  bit  (i.e.  0  or  1)  (Figure  4).  This  conceptual 
separation  of  routing  information  makes  it  possible  to  devise  the  simple  routing 
schemes  described  in  this  section. 

In  the  first  scheme,  each  switch  is  initially  set  up  to  behave  as  an  odd,  or 
evenj  switch.  In  addition,  each  switch  can  dynamically  be  set  to  one  of  the  logi¬ 
cal  states  C  OT  C .  In  other  words,  this  scheme  corresponds  to  a  direct  imple¬ 
mentation  of  the  conceptual  view  of  switch  states.  Destination  tags  are  used 
and,  according  to  Theorem  3.1,  the  state  of  the  network  is  transparent  to  the 
sender  of  the  message  since  it  only  affects  the  path  of  the  message  and  not  its 
destination.  Consequently,  rerouting  is  also  transparent  in  the  sense  that  it 
results  from  a  change  in  the  network  state.  In  practice,  the  implementation  can 
be  such  that,  for  instance,  state  C  (or  C)  is  used  as  the  default  state  for  each 
switch  in  the  lADM  network  and  the  switch  regards  the  other  nonstraight  link 
as  a  spare  link  for  rerouting;  if  a  nonstraight  blockage  is  detected,  then  the 
switch  changes  state  to  C  (or  C)  so  that  the  spare  link  is  used  instead.  This 
scheme  is  called  the  Self-Repairing  State-Based  Destination  Tag  (SSDT )  scheme. 

Rerouting  is  useful  not  only  when  one  nonstraight  link  in  a  switch  is  faulty 
or  busy,  but  also  if  both  nonstraight  links  are  busy.  For  example,  when  consid¬ 
ering  a  packet  switching  environment,  rerouting  may  be  desirable  as  a  means  of 


balancing  the  message  load  throughout  the  network.  The  scheme  proposed  here 
is  well  suited  for  this  purpose.  Assume  that  each  nonstraight  link  has  an  associ¬ 
ated  buffer  (queue).  When  both  nonstraight  links  are  busy  due  to  message 
traffic  congestion,  a  switch  can  choose  which  nonstraight  buffer  to  assign  a  mes¬ 
sage  to  (i.e.  which  state  to  associate  with  that  queued  message),  based  on  the 
number  of  messages  present  in  the  buffers  in  order  to  evenly  distribute  the  mes¬ 
sage  load  to  the  nonstraight  links. 

The  proposed  SSDT  scheme  has  the  advantages  that  it  uses  simple  n-bit 
destination  tags  and  is  capable  of  rerouting  messages  when  blockages  occur  in 
nonstraight  links.  In  addition,  rerouting  of  a  message  is  transparent  to  its 
sender  since  the  path  of  the  message  is  determined  by  the  state  of  the  network. 
For  a  given  destination  tag,  the  routing  behavior  of  each  switch  on  a  possible 
path  is  determined  by  the  state  of  the  switch,  i.e.  the  SSDT  scheme  is  fully  dis¬ 
tributed  and  rerouting  is  done  dynamically.  Each  switch  requires  a  negligible 
amount  of  extra  hardware  for  the  detection  of  blocked  links  and  the  representa¬ 
tion  of  two  possible  states. 

The  second  scheme  is  called  the  Two-Bit  State-Based  Destination  Tag 
{TSDT)  scheme  and  it  uses  2n-bit  routing  tags,  which  specify  both  the  destina¬ 
tion  of  ttii;  message  and  the  states  of  switches  on  the  corresponding  path.  The 
TSDT  scheme  has  the  advantage  that  rerouting  is  possible  when  blockages 
occur  for  straight  as  well  as  nonstraight  links. 

As  with  the  first  scheme,  the  TSDT  scheme  assumes  that  each  switch  is 
appropriately  initialized  to  behave  as  an  odd,  or  even^  switch.  Each  "digit"  of 
the  routing  tag  is  represented  by  two  bits  and  6,,  called  the  state  bit  and 
the  destination  bit,  respectively.  For  this  scheme,  the  state  of  a  switch  of  stage 
»  is  specified  by  :  if  l>n+,=0,  the  switch  is  in  state  C  and  if  ,  the 
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switch  is  in  state  C.  For  all  i,  0  <  »  <  li  fc,  =  d^.  In  general,  if  j  is  an 
even,  switch,  =00  and  6,6„^,=01  direct  the  message  through  a  straight 

link,  6i6„^.,=10  through  link  +2‘  and  6j6„^,=ll  through  link  —2';  if  j  is  an 
odd,  switch,  and  directs  the  message  through  a  straight 

link,  6,/)„^,=01  through  link  +2’  and  6,6„^,=00  through  link  —2'.  In  general, 
given  a  switch,  the  destination  bit  specifies  use  of  a  straight  link  or  a  non¬ 
straight  link  while  the  state  bit  determines  the  choice  of  the  positive  or  the 
negative  link  (if  the  chosen  link  is  a  nonstraight  link).  Since  state  information 
is  carried  by  the  routing  tag,  switches  are  not  required  to  determine  and 
remember  their  own  states,  i.e.  the  design  of  the  switches  does  not  need  to 
implement  the  logic  states  C  and  C. 

From  Theorem  3.2,  a  nonstraight  link  blockage  at  stage  i  can  be  bypassed 
conveniently  by  complementing  the  i-th  state  bit  while  the  destination  bits 
remain  unchanged.  For  convenience  of  reference,  this  is  restated  in  terms  of  the 
TSDT  scheme  as  Corollary  4.1  below. 

Corollary  U.  Let  ^  n/2n-i  the  state  bits  of  the  routing  tag  and 

the  rerouting  tag,  respectively,  for  the  lADM  network.  In  order  to  bypass  a 
nonstraight  link  blockage  at  stage  t  ,  state  bit  6„^.,  needs  to  be  changed  to 
That  is,  b  „/2n-l  ~  ^n/n  +  j-l^n  +  «^n+«  +l/2n  I’ 

Figure  7  illustrates  an  example  of  routing  from  s  =  1  to  d  =  0  in  an  lADM 
network  of  size  N  —  8.  Let  60/5  =  000000  be  the  routing  tag  and  60/5  and 
denote  the  rerouting  tags.  The  original  tag  fco/5  =  000000  specifies  the  path 
(IC^o  ,  065j  ,  0€52  ,  OG53).  If  ,  0G5|)  is  blocked,  the  rerouting  tag 

60/5  =  000100  is  obtained  by  complementing  63,  and  link  (IG^q  ,  2G5|)  is  used 
for  rerouting.  This  tag  specifies  the  path  (iG-^o  >  2G*‘’i  ,  0G52 , 0G-‘’3)-  If 
(2GS’i  ,  OE52)  's  also  blocked,  the  rerouting  tag  60/r,  =  000110  re.sults  from 


complementing  64,  and  link  (2€5i  ,  4G52)  is  used  for  rerouting.  This  tag 
specifies  the  path  (IG-Sq  ,  2G5i  ,  46^2 , 0653). 

As  discussed  in  Section  3,  a  straight  link  blockage  and  a  double  nonstraight 
link  blockage  cannot  be  overcome  easily;  implementing  a  backtracking  (or  look¬ 
ahead)  mechanism  is  a  must  in  order  to  evade  these  types  of  blockages.  Since 
all  links  in  the  routing  path  from  stage  to  stage  i  consist  of  only 

straight  links,  backtracking  of  at  least  k  stages  is  required  to  find  the  switch 
from  which  an  alternate  routing  path  branches.  That  is,  at  least  k  state  bits 
need  to  be  considered  for  change.  Due  to  the  similarity  between  Theorems  3.3 
and  3.4,  the  TSDT  schemes  for  finding  the  rerouting  paths  from  Theorems  3.3 
and  3.4  are  exactly  the  same,  which  is  stated  as  Corollary  4.2. 

Corollary  4*2  Let  fc„/2n-i  ^  n/2n-i  state  bits  of  the  routing  tag  and 

the  rerouting  tag,  respectively,  for  a  source/destination  pair  in  the  lADM  net¬ 
work.  Let  i  —k  be  the  largest  stage  number  for  t  >  ^  >  0  such  that  a  switch  at 
stage  i—k  is  connected  to  a  nonstraight  link  on  the  routing  path.  In  order  to 
bypass  a  straight  link  blockage  or  a  double  nonstraight  link  blockage  at  stage  i, 
only  state  bits  ^n+(i-fc)/n+t-i  changed;  (i) 

6  if  the  nonstraight  link  at  stage  i—k  of  the  ori¬ 
ginal  path  is  link  -2‘~*,  and  (ii)  _*/,_!  if  the  non¬ 
straight  link  at  stage  i—k  of  the  original  path  is  link  The  state  bits 

b  n+i/2n-i  b^ve  arbitrary  values  in  both  cases. 

Proof:  See  Appendix  Al.  □ 

The  example  in  Figure  7  can  be  used  to  illustrate  the  TSDT  scheme  for  (a) 
a  straight  link  blockage  and  (b)  a  double  nonstraight  link  blockage,  (a)  Again 
the  tag  60/5  000000  specifies  a  path  (IG^q  ,  OG-S"!  ,  06*52  ,  06*53).  If  the 

straight  link  (06*5i  ,  06*52)  is  blocked,  the  rerouting  tag  can  be  000110  which 


specifies  path  (iG'S'o  >  2G5j  ,  46-52  ,  OG53)  by  having 

^3+0^3+i^3+2  ~  ^o‘^i^3+2  ~  Since  state  bits  ^>3^263^2  Se  arbitrary, 

000100,  for  example,  is  also  a  valid  rerouting  tag;  it  specifies  path 
(IG-^o  ,  265i  ,  06-52  ,  O653).  (b)  Let  the  tag  60/5  =000110  specifies  a  path 

(lG5o  ,  265i  ,  4652  ,  O653).  If  both  nonstraight  output  links  of  4652  are 
blocked,  the  rerouting  tag  b  0/5  can  be  000100  which  specifies  path 
(165o  ,  265j  ,  0652  ,  O653)  by  having  634063^163^2  =  ^3+0^1^^2-  ^^i'lce  state 
bits  6342  can  be  arbitrary,  000101  is  also  a  valid  rerouting  tag  which  also 
specifies  the  same  path. 

The  rerouting  path  computed  from  Corollary  4.2  is  blockage-free  from 
stage  0  to  stage  i.  While  the  rerouting  path  is  different  from  the  original  rout¬ 
ing  path  from  stage  i—k  to  stage  t,  the  routing  path  from  stage  0  t(»  i  -k-\ 
remains  the  same.  This  results  from  the  fact  that  backtracking  always 
proceeds  backward  along  the  original  path  until  it  stops  at  stage  i—k,  and  the 
rerouting  path  only  changes  course  from  stage  i—k  onwards.  Although  state 
bits  6„4,Y2n-i  remain  unchanged,  the  routing  path  from  stage  i  to  n  — 1  may 
still  be  altered  due  to  the  changes  from  stage  i—k  to  t.  For  example,  in  f  igure 
5,  the  switch  on  the  original  routing  path  at  stage  i-f-1  is  j65, 4]  whereas  the 
switch  on  the  rerouting  path  at  stage  f-f-1  may  be  (7-1-2’ *^’)65, , ,,  which  may 
further  induce  changes  at  higher-order  stages. 

In  the  TSDT  scheme,  the  tag  can  be  computed  by  the  message  sender 
which  is  assumed  to  know  the  location  of  faulty  links  and  switches  in  the  net¬ 
work.  Thus,  rerouting  is  transparent  to  the  switches  in  the  sense  that  the  tag 
computed  by  the  sender  of  the  message  simply  avoids  the  usage  of  faulty  links 
and  switches.  Therefore  switches  do  not  require  any  extra  hardware  for  rerout¬ 
ing  purposes.  An  alternative  is  to  implement  dynamic  rerouting  for  the  'I'SDT 


scheme.  Since  backtracking  is  indispensable  for  avoiding  a  straight  link  block¬ 
age,  it  is  required  that  each  switch  can  detect  the  inaccessibility  of  any  output 
port  (connected  to  a  switch  at  the  next  stage)  and  signal  the  presence  of  the 
blockage  back  to  the  switches  of  previous  stages  [10] [12].  Whether  rerouting  is 
done  by  the  sender  or  dynamically  is  an  implementation  decision  which  depends 
on  how  many  stages  of  backtracking  are  allowed.  When  the  sender  computes 
the  tag,  it  must  be  able  to  identify  and  track  the  switches  and  links  on  the 
corresponding  routing  and  rerouting  paths  (the  next  paragraphs  explain  how 
this  is  done).  If  any  of  the  switches  or  links  in  the  path  is  known  to  the  sender 
as  being  faulty,  then  the  sender  computes  another  tag  by  changing  the  state 
bits  as  described  in  Section  5. 

Locating  the  switches  on  the  routing  path  is  straightforward.  For  a  given 
source  s  and  a  destination  d,  the  initial  routing  path  can  be  specified  by  setting 
state  bits  i„/2n-i  =0n/2n-i  string  of  n  O’s),  equivalent  to  setting  every 
switch  in  the  lADM  network  to  state  C.  Then  every  switch  on  the  original 
path  has  label  ,  0  <  t  <  n— 1,  since  now  the  lADM  network 

functions  like  an  ICube  network  [6] [15]. 

To  find  the  switches  on  the  rerouting  path,  let  be  the  switch  whose 

output  link  is  blocked.  First  consider  the  case  where  the  blocked  link  is  a  non¬ 
straight  link.  It  may  be  an  (a)  positive  or  (b)  negative  link.  In  case  (a)  the 
switch  at  stage  t-l-1  reached  by  the  positive  link  is  (y+2')G5^  +  i  and,  from 
Corollary  4.1,  rerouting  can  done  through  switch  (j  — 2‘ )G5i  +  i.  In  case  (b)  the 
switch  at  stage  t  +l  reached  by  the  negative  link  is  (j— 2')G5,  +  i  and,  from 
Corollary  4.1,  rerouting  can  done  through  switch  (y-l-2')E5,^.j.  Let  the  switch 
at  stage  t'-fl  on  the  rerouting  path  be  The  state  bits  i>n+(i +  i)/n-i 

remain  intact  (equal  to  O’s)  because  it  corresponds  to  having  every  switch  from 


stage  i+l  to  n— 1  remain  in  state  C  so  that  the  lADM  network  from  stage  i -i  1 
to  n— 1  can  emulate  the  ICube  network  from  stage  ;  t  1  to  rt  — 1.  t.s,  the  bits 
I,  1+1  <  /  <  of  the  label  of  a  switch  on  the  rerouting  path  are  .j. 

From  Lemma  2.1,  bits  0  to  1  — I,  1  <  ^  i  -■  1,  of  the  label  of  a  switch  on  a  path 

to  destination  must  be  Ihuice  the  switch  on  the  rerouting  patn 

from  stage  x+1  to  n  — 1  has  label  dg//  ,,  i  M  /  •  n  -  I. 

Next  consider  the  case  where  the  blockage  of  jG.*",  is  a  straight  link  block¬ 
age  or  a  double  nonstraight  link  blockage  so  that  backtracking  is  necessary. 
There  are  two  sub-cases  for  each  type  of  blockage;  (i)  the  nonstraight  link 
found  in  backtracking  is  a  negative  link  and  (ii)  it  is  a  (josilive  link.  Here  only 
sub-case  (i)  of  the  straight  link  blockage  is  considered;  the  other  cases  can  be 
dealt  with  similarly.  From  the  proof  of  ('orollary  1.2  (case  (i)  only),  the  switch 
on  the  rerouting  path  is  (j42^ i —k  <  I  i.  I  Ik'  s\vitch  of  stage  i  \  1  on 
the  rerouting  path  is  ,i  if  -■  ()  and  jG.S.i  an  odd,  switch  or  if 

=  1  and  jG.*?,  .  i  is  an  et’cn,  switch,  and  is  2'  '  )G-S’,  .  ;  if  --- 0  and 

jG-S',  ,1  is  an  even,  switch  or  if  6,,,,  -  1  and  jG-S  .  i  is  an  odd,  swilcli.  'I'he 

identilication  of  switches  on  the  rerouting  i).ith  from  s'.ige  j  H  to  ?(  1  is  doin' 

as  in  the  case  of  a  nonstraight  link  blockag«'  described  above. 

The  blocked  link  can  be  represented  by  the  two  switclies  joined  by  the  link. 
Since  every  switch  on  the  original  routing  path  and  the  rerouting  paths  can  be 
easily  identified  as  described  above,  it  can  be  readily  determined  whether  or  not 
the  blocked  link  is  on  the  current  path. 

In  summary,  for  both  SDT  schemes,  the  binary  repn  senlation  of  the  ilesti- 
nation  address  can  be  used  directly  as  I  lie  routing  lag.  In  the  SSI’l'l’  scheme, 
rerouting  tags  are  not  needed  and  in  l.lu-  schi  un',  rerouting  tags  result 

from  simple  bit  coinfileiiKT,  ting  operations.  In  tenir  "f  <nmplexit  y  of  the 


computation  for  a  rerouting  tag,  the  SSDT  scheme  and  the  TSDT  scheme  for 
one  instance  of  nonstraight  link  blockage  require  timeXspace  complexity  0(l); 
an  improvement  over  previous  proposed  schemes  (9]  dealing  with  rerouting  for  a 
nonstraight  link  blockage  that  require  timeXspace  complexity  O(logiV).  In  [lO] 
a  single-stage  look-ahead  scheme  for  rerouting  of  a  straight  link  blockage  was 
proposed;  it  requires  use  of  two’s  complement  to  compute  the  positive  and  nega¬ 
tive  dominant  tags  so  that  the  scheme  has  timeXspace  complexity  of  0(log7V). 
Note  that  the  single-stage  look-ahead  rerouting  scheme  is  valid  only  for  some 
cases  of  the  straight  link  blockage;  it  cannot  be  applied  to  any  case  of  the 
straight  link  blockage.  From  Corollary  4.2,  fc-stage  backtracking  is  needed  for 
a  straight  link  blockage  and  k  bits  of  the  state  bits  needs  to  be  changed;  thus 
the  complexity  of  the  TSDT  scheme  for  a  nonstraight  link  is  0{k).  If  only 
single-stage  backtracking  (corresponds  to  single-stage  look-ahead)  is  necessary, 
rerouting  can  be  done  dynamically  and  the  complexity  is  0(1),  an  improvement 
over  the  scheme  in  [10]. 

5.  A  Universal  Rerouting  Algorithm  for  Multiple  Blockages 

The  TSDT  scheme  can  be  applied  to  not  only  one  instance  of  some  block¬ 
age,  but  also  can  be  applied  repetitively  each  time  a  new  blockage  is  encoun¬ 
tered  as  the  message  propagates  along.  This  section  considers  the  derivation  of 
an  algorithm  to  deal  with  any  case  of  multiple  blockages.  The  backtracking 
schemes  proposed  in  C-’orollary  4.2  find  a  rerouting  path  for  a  straight  link 
blockage  and  a  double  nonstraight  link  blockage.  Nevertheless,  it  is  possible 
that  blockages  also  exist  on  the  rerouting  path;  then  further  backtracking  to  a 
lower-order  stage  is  needed.  Since  this  phenomenon  can  recur,  repeated  back¬ 
tracking  may  be  n<;cessary  due  to  blockages  on  the  rerouting  paths.  The 


algorithm  BACKTRACK  described  next  performs  iterated  backtracking  to  lin<l 
an  alternate  routing  path.  It  underlies  a  universal  rerouting  algorithm  (called 
REROUTE)  to  be  shown  later  that  can  find  a  routing  path,  if  there  exists  any, 
to  bypass  multiple  blockages  in  the  network. 

The  inputs  to  algorithm  BACKTRACK  are  the  current  routing  path  F,  tlu- 

t 

stage  number  i  where  a  blockage  occurs,  and  state  bits  representing 

path  P.  The  algorithm  returns  updated  values  of  the  state  bits  b  \  which 
specify  a  rerouting  path  that  is  blockage-free  from  stage  0  to  stage  t  if  such  a 
rerouting  path  exists,  or  returns  FAIL  if  the  blockages  on  the  current  routing 
path  and  the  rerouting  paths  eliminate  the  possibility  of  communication 
between  the  source  and  the  destination.  It  is  assumed  that  the  blockage  on  the 
original  routing  path  at  stage  i  is  a  straight  link  blockage  or  a  double  non¬ 
straight  link  blockage  and  jC.S’,  is  the  .switch  whose  outpu*  link.--  an  tin- 
blocked  links.  Informal  explanations  for  the  algorithm  will  be  given  following 
the  algorithm  and  the  correctness  proof  of  this  algorithm  can  be  foum!  in 
Appendix  A2. 

Algorithm  BACiK'l'RACK  (and  ItFROUTE)  presumes  existence  of  the 
knowledge  of  all  blockages  in  the  network.  The  network  controller  is  responsi¬ 
ble  for  collecting  this  information  and  maintaining  a  global  maf)  of  blockages, 
which  is  accessible  to  every  sender  of  the  messages  in  order  to  compute  a  path 
to  avoid  the  blockages.  In  addition,  since  it  may  take  several  iterations  before  a 
blockage-free  path  can  be  found  or  it  can  be  concluded  that  no  blockage-free 
paths  exist,  the  sender  of  the  message  needs  to  maintain  and  update  the  loca¬ 
tions  of  switches  on  the  rerouting  path  in  each  iteration. 

Algorithm  BACKTRACK  (/',  t,  ,) 


vv. 
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0;  q=  stage  number  where  a  blockage  occur. 


1:  P=  the  current  routing  path. 

Backtrack  on  path  P  from  stage  q  to  find  a  nonstraight  link.  If  no  non¬ 
straight  link  exists  at  any  preceding  stage,  return(FAIL);  otherwise  assign 
to  r  the  stage  number  where  the  first  nonstraight  output  link  is  found. 

2:  If  the  nonstraight  link  at  stage  r  on  the  routing  path  is  -t-2'^,  assign  flag 

linkfound  value  0;  if  it  is  —2'^,  assign  linkfound  value  1. 

3:  If  linkfound  =0,  b\/2n-\  ^'n/n+r if  linkfound  =  1, 

^  n/2n-l  *  ^  n/n+r-l^r/ij-1^  n-i-q /2n~l- 

4a:  This  step  applies  only  when  the  blockage  at  stage  q  on  path  P  is  a  straight 
link  blockage. 

If  linkfound  =  0,  set  h  if  ((j-2^)€5,  ,  +  is  blocked, 

change  6  to  furthermore,  if  {{j—2'^)£Sg  ,  +  is  also  blocked, 

return(FAIL).  If  linkfound  =  1,  set  ^*n+9  ~  ^q'>  if 

{{j+2‘’)ES^  ,  (y+2‘*^*)GS^^.i)  is  blocked,  change  to  d^;  furthermore,  if 

(( j -|-2‘^ )G.9^  ,  is  also  blocked,  return(FAJL). 


4b:  'I'his  step  applies  only  when  the  blockage  at  stage  q  on  path  P  is  a  double 
nonstraight  link  blockage. 

If  ((j  —2‘^)£S^  ,  ( j —2“^  )ES^  ^  i)  is  blocked  for  linkfound  =0,  or 
(( j -1-2'' )GS',  ,  (j +2'' )GS'^  ,.i)  is  blocked  for  linkfound  =  1,  return(FAIL). 

5:  Let  Q  denotes  the  part  of  the  rerouting  path  (specified  by  the  tag  in  step 

3)  from  stage  r  f  1  to  ^  from  step  3. 

If  linkfound  =0,  Q  = 

((;-2^^‘)65,.,  ,  ■  •  ■  ,  (;-2’  ,  ,  (>-2’)e5,),  if  linkfound  =  1, 


n 
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Q  =  {U+2'^^)es,^, ,  ■■  ■  ,  (j+2’->)65,_,  ,  {j+20)es.). 


If  a  blockage  occurs  on  path  Q,  return(FAIL). 


6:  If  linkfound  =0  and  {(j— ,  (j— is  blocked,  or  if  link- 


found  —  1  and  ((y+2'^)€'5r  ,  ( 7 -r  2^^  ^ ^  i)  is  blocked,  go  to  step)  7;  elst 


return(I)'„/2n-i)- 


7:  j  *—  j+2\  q  *—  r. 


8:  Backtrack  on  path  P  from  stage  q  to  find  a  nonstraiglit  link.  If  no  non¬ 


straight  link  exists  at  any  preceding  stage,  returii(FAIL);  otlu^rwise  assign 


to  r  the  stage  number  where  the  first  nonstraight  output  link  is  found. 


9:  If  linkfound  =0  and  the  nonstraight  link  at  stage  r  is  ■-2\  or  if  hiik 


found  =  1  and  the  nonstraight  link  at  stage  r  is  -1-2'^,  return (F.Mb). 


10:  If  linkfound  =0,  6  „/2n-i  ^  ^  n/n  n  i<,/L'n  li  Imkfovnd  I, 

t  '  T  * 

^  n/2n-l  ^  n/n  i^r  /q-\^  n-\-q/2n~i'  step  -li). 


Step  0  is  the  initialization  step.  From  Theorems  3.2  and  3.4,  an  alternate 


path  exists  for  avoiding  a  straight  link  blockage  or  a  double  nonstraight  link 


blockage  if  and  only  if  there  exists  a  nonstraight  link  at  some  stage  precniing 


stage  r;  step  1  of  the  algorithm  searches  backward  for  such  a  nonstraiglit  link. 


If  not  found,  it  results  in  premature  termination  of  the  algorithm,  reflecting  the 


fact  that  no  alternate  paths  for  rerouting  exist,  ''teji  2  i.s  used  to  dilli'n  iit iate 


the  cases  when  the  nonstraight  link  at  stage  r  found  in  the  first  backtracking  is 


a  positive  link  and  when  it  is  a  negative  link;  flag  linkfound  is  assigned  0  for 


the  former  and  1  for  the  latter.  If  a  nonstraight  link  exists  at  some  stage 


preceding  the  blockages,  in  step  3,  Ciorollary  4.2  is  apiplied  to  find  the  stage  bits 


specifying  the  rerouting  path;  cases  (i)  and  (ii)  in  (iorollary  4.2  correspond  to 


linkfound  —  1  and  linkfound  =  0,  respectively,  ami  q  ami  r  corres[)ond  to  t 


M 
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and  i—k,  respectively. 

Steps  4a  and  4b  deal  with  the  link  blockage  at  stage  q  on  the  rerouting 
path  computed  in  step  3.  If  the  blockage  of  a  switch  at  stage  q  on  path  F  is  a 
straight  link,  the  possible  rerouting  links  at  stage  q  are  two  nonstraight  links. 
In  step  4a  the  default  link  is  negative  link  if  line  found  =  0  and  a  positive  link  if 
link  found  =1.  If  the  default  link  is  blocked,  step  4a  attempts  to  reroute  the 
message  through  the  other  nonstraight  link.  If  both  nonstraight  links  are 
blocked,  there  exist  no  blockage-free  paths.  Step  4b  applies  if  the  blockage  of  a 
switch  at  stage  q  on  path  P  is  a  double  nonstraight  link  blockage.  The  rerout¬ 
ing  path  must  use  a  straight  link  at  stage  q.  If  it  is  also  blocked,  no  blockage- 
free  path  exists. 

Step  5  checks  blockages  from  stage  r-pl  to  stage  g— 1  on  the  rerouting 

A 

path;  if  any  blockage  falls  on  Q,  there  exists  no  blockage-free  path.  In  step  6, 
if  the  blockage  falls  in  the  link  of  stage  r  on  the  rerouting  path,  further  back¬ 
tracking  is  necessary.  Otherwise  (no  blockages  on  the  rerouting  path),  the  algo¬ 
rithm  terminates  with  the  state  bits  specifying  the  rerouting  path.  Step  7 
updates  the  stage  number  q  and  the  switch  label  j  where  a  blockage  on  the 
rerouting  path  occurs,  initiating  a  new  iteration  of  backtracking.  Step  8  is  the 
same  as  step  1,  searching  backward  at  lower-order  stages  again  for  a  non¬ 
straight  link.  Step  9  of  the  algorithm  dictates  that  if  the  encountered  non¬ 
straight  link  in  the  first  iteration  of  backtracking  is  a  positive  (or  negative)  link, 
the  nonstraight  link  found  in  each  subsequent  iteration  of  backtracking  must  be 
also  a  positive  (or  negative)  link;  otherwise  no  blockage-free  paths  exist.  If  the 
condition  in  step  9  is  satisfied,  step  10,  which  is  the  same  as  step  3,  computes  a 
rerouting  path.  After  the  rerouting  path  is  found,  the  algorithm  returns  to  step 
4b,  to  check  for  further  blockages  on  the  rerouting  path. 


For  each  source/destination  pair,  a  link  on  some  routing  patli  for  tht 
source/destination  pair  is  called  a  participating  link.  As  a  direct  result  of 
Theorem  3.2,  the  set  of  participating  output  links  of  a  switch  is  composed  of 
either  its  straight  output  link  or  both  of  its  nonstraight  output  links,  but  tu-ver 
all  of  them.  So  the  output  link  blockages  of  a  switch,  for  a  given 
source/destination  pair,  can  only  be  a  straight  link  blockage,  a  nonstraighl  link 
blockage,  or  a  double  nonstraight  link  blockage.  Algorithm  HAC'K'rK.VC 'K 
deals  with  the  first  and  third  kind  of  blockages,  and  the  second  kind  of  block¬ 
age  can  be  overcome  by  applying  Corollary  4.1.  Algorithm  BACK'l'RAt’K  ;uid 
Corollary  4.1  can  be  used  to  form  a  universal  algorithm  capable  of  rerouting 
messages  when  multiple  blockages  exist  in  the  lADM  ne  twork.  'I'his  .ilgoi  it  !i u  . 

I 

called  REROUTE,  returns  state  bits  h  i  .specifying  a  bha-kage-fr'a'  n  rout¬ 
ing  path  if  one  exists,  or  returns  EAIL  otherwise. 

Algorithm  REROUTE  (R,  ,) 

0;  P=  the  original  routing  path. 

bn/2n-i~  routing  tag  specifying  tlu;  original  routing  path. 

6  „/2n-i~  the  rerouting  tag  specifying  tiie  rerouting  [)ath. 

I 

^  n/2n-l  ^n/2n-l- 

1;  Let  i  be  the  smallest  stage  number  ‘uich  that  there  exists  .a  blork.nv  at 
stage  I  on  path  P.  If  no  blockages  occur  on  [>atli  /’,  reiurn(6  „  ,  I. 

2:  If  the  blockage  at  stage  i  on  path  P  is  a  nonstr.aight  link  blockade  and  il  ■ 

other  nonstraight  link  is  not  blockeil.  apply  Corollary  1.1  to  find  staii  IuIn 
^'n/2n~\  and  go  to  step  4. 

3:  ^  BACKTRACK(r,  2,  ,). 


N 
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4:  Q  =  the  rerouting  path  specified  by  state  bits  f>'n/2n-i- 

P  ■> —  Q  and  go  to  step  1. 

Step  0  is  the  initialization  step.  At  the  end  of  each  iteration,  a  blockage- 
free  path  from  stage  0  to  stage  t  is  found.  Then  a  new  iteration  starts  and  i  is 
given  a  new  value  in  order  to  find  a  path  avoiding  the  blockages  at  a  higher- 
order  stage.  The  only  terminating  conditions  for  algorithm  REROUTE  are  that 
a  return  of  FAIL  from  step  3  indicating  that  no  blockage-free  paths  exist  and 
the  return  from  step  1  indicating  a  blockage-free  path  is  found.  Algorithm 
REROUTE  is  executed  iteratively  to  evade  blockages  from  lower-order  to 
higher-order  stages.  The  correctness  of  this  algorithm  follows  from  the  correct¬ 
ness  of  algorithm  BACKTRACK  and  Corollary  4.1. 


8.  Permutation  Routing  and  Cube  Subgraphs  of  the  lADM  Network 

The  results  discussed  so  far  are  a  consequence  of  the  existence  of  spare 
nonstraight  links  in  addition  to  the  ICube  network  embedded  in  the  lADM  net¬ 
work.  This  section  pursues  this  issue  further  by  showing  that  there  exist  multi¬ 
ple  distinct  subgraphs  in  the  LADM  network,  each  called  a  cube  subgraph,  that 
are  isomorphic  to  the  ICube  network.  Two  cube  subgraphs  are  considered  to  be 
distinct  if  they  differ  in  at  least  one  link.  As  mentioned  in  the  introduction  of 
this  paper,  the  cube-type  networks  have  been  studied  extensively  in  the  litera¬ 
ture  and  shown  to  be  topologically  equivalent.  Together  with  results  from  these 
studies,  the  knowledge  of  how  to  identify  cube  subgraphs  can  help  the  under¬ 
standing  of  the  capabilities  of  the  LADM  network  and  be  useful  for  permutation 
routing  in  the  LADM  network. 

Since  each  switch  can  be  in  state  C  or  C,  there  are  as  many  as  2^” 
(=  N^)  network  states,  although  each  does  not  necessarily  generate  a  unique 


permutation.  Setting  a  switch  to  a  certain  state  indicates  that  one  of  its  non- 
straight  output  links  can  be  used  for  routing  (i.e.  it  is  active)  while  t))e  other 
cannot.  Thus,  each  network  state  can  be  associated  with  a  subgrapii  of  the 
lADM  network  which  contains  only  th«'  active  links.  When  all  switches  in  the 
lADM  network  are  set  to  state  C,  the  L‘\DM  network  functions  as  an  I(!ube 
network;  this  network  state  corresponds  n  cube  sufjgraph.  'I'he  constructive 
derivation  of  a  lower  bound  for  the  number  of  cube  .suligraphs  of  the  lAD.M  net¬ 
work  uses  the  two  basic  ideas  discussed  in  the  next  paragraphs. 


Since  -^-2”"  =  — 2"“  mod  N,  C\  —  ^'n  state  of 

each  switch  of  stage  n— 1  is  irrelevant  in  the  sense  that  any  switch  at  stage; 
n— 1  is  always  connected  to  the  same  two  switches  at  stage  n.  Consequently, 
given  any  cube  subgraph,  there  exist  (2^—1)  subgraphs  isomorphic  to  it  which 
differ  only  in  their  choices  of  the  nonstraight  link  -f2’'  '  or  —2"  ’  at  stage  u  -l. 
Therefore,  the  total  number  of  distinct  cube  subgraplis  is  given  by  the  product 
of  2^  and  the  number  of  distinct  subgraphs  of  the  lADM  network  from  stage  0 
to  stage  n— 2  that  are  isomorphic  to  the  same  stages  in  the  ICube  network. 

The  calculation  of  the  number  of  subgraphs  in  the  first  n  — 1  stages  uses  an 
idea  similar  to  that  proposed  in  j.Sj  for  reconfiguring  the  DR  network  so  that  it 
performs  as  a  Generalized  Cube  network.  All  switches  of  the  lADM  network 
are  logically  relabeled  by  adding  a  constant  x,  0  ^  x  N— 1  to  the  original 
labels,  i.e.  switch  j  becomes  J  ==  J  +  x.  By  setting  each  switch  to  be  an  cren, 
or  odd^  switch  according  to  its  new  label  and  having  all  switches  be  in  state  (•', 
a  cube  subgraph  results  for  each  relabeling.  However,  of  the  N  possible  sub- 
N 

graphs,  only  —  are  distinct  as  far  as  the  first  n  1  stagt's  are  r()i\rerne(i.  This 

result  is  stated  in  'I'lieorcun  6.1.  A  graphical  intcrpn  i alion  of  cube  subgrapii 
isomorphism  for  an  lADM  network  of  size  ,V  --S  is  illustrati'd  in  I'igure  8.  In 
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Figure  8,  each  physical  switch  j  acts  as  a  logical  switch  j  =  (j+l)  8.  The 

isomorphism  to  the  ICube  network  can  be  easily  visualized  by  moving  switch  7 
to  the  top  of  each  stage  as  shown  in  the  figure.  Notice  that  setting  some  switch 
to  state  C  according  to  its  logical  label  may  be  equivalent  to  setting  the  switch 
to  state  C  according  its  original  label.  For  instance,  switch  OGSq  (logical  label 
1)  is  set  to  state  C  in  Figure  8. 

N  M  . 

Theorem  fij.  There  exist  at  least - 2^  distinct  cube  subgraphs  in  the  lADM 

2 

network. 

Proof:  See  Appendix  Al.  □ 

In  order  to  reconfigure  the  LADM  network  to  one  of  its  cube  subgraphs, 
each  switch  of  stage  t,  for  0  <  i  <  n— 2,  needs  to  know  the  i~th  bit  of  its  logi¬ 
cal  label.  This  can  be  done  by  sending  the  same  logical  label  to  every  switch  in 
the  same  row  at  system  reconfiguration  time.  Each  switch  is  set  as  being  an 
odd,  or  even,  switch  by  examining  the  i-th  bit  of  the  logical  label.  All  switches 
operate  in  state  C  according  to  its  logical  label  with  the  exception  of  those  at 
stage  n— 1  for  which  different  states  correspond  to  different  subgraphs. 

The  results  of  this  section  can  be  used  in  different  ways.  One  usage  is  in 
characterizing  a  class  of  permutations  performable  by  the  LADM  network.  Per¬ 
mutations  passable  by  the  ICube  network  are  discussed  in  [15]  and  adaptable 
from  [6].  Thus,  the  lADM  network  can  perform  all  of  these  permutations  plus 
the  same  set  of  permutations  with  a  given  x  added  to  both  the  same  source  and 

N 

destination  labels,  0  <  i  <  — .  Another  use  of  the  results  of  this  section  is 

—  2 

that  the  LADM  network  can  pass  the  permutations  performable  by  the  ICube 
network  when  the  ICube  network  embedded  in  the  lADM  network  experiences 
nonstraight  link  failures.  This  is  done  by  incorporating  a  reconfiguration 
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function  in  the  system  that  reassigns  each  switch  j  to  (j+ar)  and  reconfiguring 
the  lADM  network  to  a  corresponding  cube  subgraph  which  does  not  include  the 
faulty  nonstraight  links.  In  [21)  it  is  shown  that  any  of  the  cube-type  networks 
can  pass  the  permutations  performable  by  the  others  by  incorporating  appropri¬ 
ate  reconfiguration  functions.  By  the  same  token,  the  I.\DM  network  with  a 
nonstraight  link  fault  can  also  pass  the  permutations  performable  by  the  c\ihe- 
type  networks  by  including  these  reconfiguration  functions  in  the  system. 


7.  Concluding  Remarks 

One  of  the  main  contributions  of  this  paper  is  the  identification  of  destina¬ 
tion  tag  routing  schemes  for  the  lADM  netw'ork.  Tlu'y  are  simpler  and  more 
efficient  than  previously  known  approaches,  thus  requiring  less  complex  switches 
and  reducing  message  communication  delays  due  to  routing  overhead.  In  the 
SSDT  scheme  rerouting  can  be  done  when  nonstraight  links  fail  and  in  the 
TSDT  scheme  both  the  straight  and  double  nonstraight  link  blockages  can  be 
avoided.  As  for  the  SSDT  scheme,  routing  and  rerouting  are  transparent  to  the 
source  and  only  negligible  hardware  and  time  are  used  by  each  switch  for  rout¬ 
ing  and  rerouting  purpose.  These  are  considerable  advantages  over  previously 
proposed  schemes  which  do  not  use  destination  tags  and  require  extra  hardware 
or  delays  of  O(logA')  complexity  instead  of  0(1).  In  addition,  previous  works  all 
deal  only  with  certain  types  of  bloe  Based  on  the  'I'SDT  scheme,  a 

universal  rerouting  algorithm  is  deri  J,  which  is  capable  of  avoiding  any  com¬ 
bination  of  multiple  blockages  if  there  *xisl  a  blockage-free  path  and  indicating 
absence  of  such  a  path  if  there  exists  none.  The  rerouting  capabilities  of  the 
new  schemes  can  be  readily  used  for  fault-tolerance  and  load  balancing  pur¬ 
poses  since  they  adequately  exploit  the  redundancy  available  in  the  lADM 
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network. 

Another  contribution  of  this  paper  is  the  constructive  derivation  of  a  lower 
bound  on  the  number  of  cube  subgraphs  of  the  lADM  network.  While  it  was 
previously  known  that  the  ICube  network  is  a  subgraph  of  the  LADM  network, 

this  paper  shows  that  there  exist  at  least  ^'2^  distinct  cube  subgraphs.  This, 

combined  with  previous  multistage  cube  network  studies,  can  help  characterize 
some  of  the  permutations  performable  by  the  LADM  network.  As  other  use  of 
the  subgraph  analysis,  it  is  shown  how  to  reconfigure  the  LADM  network  under 
nonstraight  link  faults  to  pass  the  cube-admissible  permutations. 

Perhaps  the  most  fundamental  contribution  of  this  paper  is  that  of  the  net¬ 
work  state  model  used  for  the  LADM  and  the  ICube  networks.  The  essence  of 
this  model  is  in  the  recognition  that  the  routing  action  of  each  switch  is  concep¬ 
tually  dependent  on  its  position  in  the  network  (topological  information),  its 
state  (functional  information),  and  the  destination  of  the  message  (routing 
information).  Topological  information  is  fixed  and,  when  using  destination 
tags,  the  same  can  be  said  of  routing  information  for  a  given  message  destina¬ 
tion.  Consequently,  the  routing  path  is  solely  determined  by  the  state  of  the 
network.  These  basic  concepts  are  applicable  to  networks  other  than  those  con¬ 
sidered  in  this  paper;  the  state  model  can  help  devise  new  designs,  solve  routing 
problems,  and  understand  relationships  among  networks. 
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Appendix  A1 

Proof  of  Theorem  S.S 

The  "only  if'  part  follows  immediately  from  Theorem  '.’>.2.  To  prove  the  "if" 
part,  let  be  the  switch  whose  straight  ont])ut  link  is  the  blocked  link  on 

the  routing  path  and  i —k  be  the  largest  stage  number  lor  i  k  0  such  tha; 
a  switch  at  stage  i-k  has  a  nonstraight  ontjiut  link  on  tin  rcniTnig  path,  i;! 
Assume  that  the  nonstraight  link  at  '^ir.ste  i  k  found  in  !iaek(r.aekir;g  lin.k 
—2*  Clearly,  as  illustrat<  d  in  h'igure  ■<.  !  ho  patl' 

((;+2'-‘)e5.  ,  ,  (j+2’  . .  KS,  .,1  h-  reron.ins 

path  for  path  ((j  *-2'  .  jO'.  . .  ./CT  .,,'.  (n' 

Assume  that  the  nonstraight  link  at  stage  t  k  found  in  backtrackin'.,  i ;  kn,-; 
-h2'  similarly  ptilh 

{(j-2’  ,  (.?-2’  . (./-  ■  C  rcrouMng 

path  for  path  ((j  --2'  ^](.\  ^  ,  j(  S\  ^  ,  .  j(\\  . . . /T.S  ,  ,/(  . ,  i.  '  ' 

Proof  of  Theorem  ‘k\ 

The  "only  if"  part  again  follows  imme<liate!y  from  Tfu'orem  d.2.  To  I'tove  the 
"if"  part,  let  notations  i,  i  -k  and  ./(>’,  be  the  same  as  those  in  the  proo'  of 
Theorem  3.3.  The  proof  is  illiistra'ed  in  ligiirc  t>.  I’rom  Theorem  2.2. 
(j-2‘)e.S  M  and  (j  4  2'  )(?>’,.!  can  reach  the  same  subs('t  of  de.st  inat  ions  st)  that 
it  does  not  matter  which  is  on  the  rerouinn.'  p.nth.  li)  .A^s^ime  tlint  tin  non 
straight  link  at  stage  i  -k  found  in  ba<-ktracking  is  link  2'  It  is  ^elf- 
explanatory  Mint  imt!) 

/  r  *■  I  o*  ^  \  .  I  o*  ^  *  I  \  o  o  I  .  1  « » I  ,  o  m  I  *1*  I  i  . 
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path  ((j-2'-*)e5.  *  ,(j-2‘  *^‘)e.S-,  ,,,,  . (j_2*)e.S)  .  (j-2’)G5.„)  is  a 

rerouting  path.  Note  that  the  participating  input  link  of  jES^  may  be  a  non¬ 
straight  link;  however,  this  is  just  a  special  case  for  /c  =  1.  □ 

Proof  of  Corollary  ^.2 

First  two  lemmas  are  presented,  which  are  to  be  used  to  prove  Corollary  4.2. 

l.emma  Al.l  In  the  TSDT  scheme,  the  links  +2^  and  —2^  connected  to  a  switch 
jESi  are  specified  by  tag  bits  =  fiJt  and  6/6,^,;  =  respectively,  and 

the  straight  link  is  specified  by  6j6„  =  jpi  or  =  JiJi- 

Proof:  Follow  immediately  from  the  definition  for  the  TSDT  scheme.  □ 

bcmma  A1.2  (i)  Let  jES^  and  (j+2^)G5/^,  be  two  switches  joined  by  a  positive 
nonstraight  link  -1-2^  and  they  are  on  a  path  to  the  destination  In  the 

TSDT  scheme,  the  routing  tag  can  be  set  to  to  control  routing  to 

send  the  message  from  jESi  to  (;+2^)GS'/ (ii)  l.et  jESi  and  (j— 2*)G5;^.i  be 
two  switches  joined  by  a  negative  nonstraight  link  —2'  and  they  are  on  a  path 
to  the  destination  do/„  ,.  In  the  TSDT  scheme,  the  routing  tag  can  be  set  to 
control  routing  to  send  the  message  from  jESi  to  (j— 2^)GS(^i. 
Proof:  Only  proof  for  (i)  is  given  and  proof  for  (ii)  is  similar.  From  Lemma  2.1 

and  the  proof  for  Theorem  3.1,  the  switch  j{=  7  i2^)G.S;^,  has  the  label 

> 

Jo/n  1  ■  ^o/l  i^C^’/  +  i/n  1’  where  |  depends  on  network  state.  So 

.’  ,  '  ”7  -'I 

j  I  =  di-  Additionally,  j  i  =  ji  because  j  =  j +2  .  Hence  j;  =  di-  By  Lemma 

Al.l.  =  didi.  □ 

Proof  of  Corollary  4.2: 

Only  proofs  of  (i)  for  (a)  a  straight  link  blockage  and  for  (b)  a  double  non¬ 
straight  link  blockage  are  given;  proofs  of  (ii)  for  cases  (a)  and  (b)  are  similar. 
Since  the  destination  bits  always  remain  unchanged,  only  state  bits  need  to  be 
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considered,  (a)  This  proof  first  derives  the  state  bits  controlling  the  rerouting 

path  =  ((y+2‘  ‘ , ,  ,  . (j  +2‘je5, )  in  I'igurc  o 

(which  illustrates  the  proof  of  Theorem  3.3).  Since  the  links  on  path  Q  are  all 
positive  nonstraight  links,  by  Lemma  .41.2,  k]/n-i  i  ‘A  kji  i  re/'re.sc  nl.^ 

the  state  bits  for  path  Q^.  In  addition,  by  'I'heorem  3.2,  the  link  of  stage  i  on 
the  rerouting  path  can  be  either  link  —2'  ((yi2')L.S,  n)  or  link  42' 

((j+2’)G6'',  )  (j +2' ‘  ,  i).  'I'hus  can  be  0  or  1.  (b)  Notice  that  the 

rerouting  paths  from  stage  7 —A  to  stage  i  found  in  'Pheorem  3,3  and  'Phcf  rerTi 
3.4  are  the  same  except  the  the  link  of  stage  t  on  the  rerouting  path  is  a  ra  n- 

straight  link  in  I'heorem  3.3  (l‘'igure  a)  and  it  is  a  .'■traight  link  in  Tlieorein  ib'l 

(Figure  6).  By  Lemma  Al.l,  the  stale  bit  which  specifies  the  straight  link 

at  stage  i  in  Thc'orem  3.1,  can  be  0  or  1.  So  the  state  bits  specifying  the 

rerouting  path  from  stage  t —A  to  stage  t  are  the  sarix'  a.s  tliG.^a  m  fa). 
^  n +(i  +  l)/2n - 1  '‘fbitrary  bccauso.  rogardlcss  of  the  vaiue.c  of  t ;  1  g 

as  long  as  the  destination  bits  are  1  -■  doy„_.,,  the  patfi  can  much  the  dts- 

tination  do/„_i-  bJ 


Proo f  of  Theorem  6. 1 

Consider  two  cube  subgraphs  generated  by  adding  x  and  y,  respectively,  to  the 
original  labels  of  all  switches  of  the  lADM  network.  It  is  shown  that 


N  i\’  ...  . 

X  mod  —  ^  7J  mod  ~  is  a  sufTicieiit  condition  for  these  subgraphs  to  be  de 
2  2 


tinct  in  the  sense  that  they  ditfer  In  at  least  one  link  of  the  first  71— 2  aages  (it 


is  also  possible  to  show  tlu’  necessity  of  this  condition),  d'o  pro\e  that  tin  sub¬ 


graphs  are  distinct,  it  is  shown  that,  given  the  condition  above,  there  ( xists 
some  physical  switch  j  C,S'„  such  that  {j  tx)  and  [j  fy)  differ  in  their 


(n~2)-th  bit,  i.e.  the  switch  with  logical  label  (j  fx)  is  an  cve.n^  switch  and  the 
switch  with  logical  lalxd  (;  fy)  is  an  od</,  switch,  or  vici-  versa.  This  implies 
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I  ~al  ‘at 


that  a  different  nonstraight  link  is  used  and  therefore  the  subgraphs  are  dis¬ 
tinct.  Let  the  h-th  bit  of  Xo/„_2  and  yo/n-2  be  the  highest  order  bit  such  that 
^k^Vhy  *-e.  Xfc  +  i/„-2  =  y/i  +  i/n-2-  Here  h  <  n— 2  since  only  the  topology  of  the 
lADM  network  from  stage  0  to  stage  n— 2  is  considered.  Without  loss  of  gen¬ 
erality,  assume  that  i;,  =  0  and  y)^  =  I  and  let  j  o/n-2  ~  ^o/A-il^A+i/n-2 
(where  i  is  a  string  of  h  O’s).  Then 

{j  +^)o/n-2  =  ^0//i  l(0  +  l)U  fl/n  2  —  ■^O/^i-ll/i/n-al 

{j  +y)o/n  2  =  yo/k  t(l  +  l)U  +  l/n-2  =  ^0/^  -  /n  -  3^ 

N 

differ  in  the  value  of  their  {n—2)-th  bit.  Therefore  there  exist  —  distinct  cube 

^  2 

subgraphs  when  considering  only  the  topology  of  the  lADM  network  from  stage 

0  to  stage  n— 2.  For  each  of  these  ^  cube  subgraphs,  there  exist  2^  subgraphs 

2 

of  the  lADM  network  which  differ  from  it  only  in  the  choice  of  the  nonstraight 

N  fj  .  . 

links  at  stage  n  — 1.  Thus,  the  L\DM  network  contains  at  least  — 2^  distinct 

2 

cube  subgraphs.  □ 


Appendix  A2:  Proof  of  Algorithm  BACKTRACK 

Terminology  and  two  lemmas  are  introduced  first  in  order  to  lay  the 
ground  for  the  verification  of  algorithm  BACKTRACK.  Given  a  source  and  a 
destination,  a  switch  on  some  routing  path  for  the  source/destination  pair  is 
called  a  pivot.  Conversely,  by  the  definit'on  of  a  pivot,  a  path  in  the  LADM  net¬ 
work  can  reach  the  destination  if  and  only  if  it  passes  through  a  pivot  at  each 
stage.  The  set  of  piivots  at  each  stage  varies  with  different  source/destination 
pair  and  is  characterized  by  the  following  lemma. 


Lemma  A‘2.1  Let  k  be  the  smallest  stage  number  for  which  there  exists  a  uon- 
straight  link  on  at  least  one  routing  path  from  a  given  source  ,  to  a  given 

destination  do/n-i  in  the  lADM  network.  For  this  source/destinatio/i  pair, 
there  is  exactly  one  pivot  at  stage  k,  0<  ^  ^  tluTC  exists  exactly  two 

n  ^  II  I 

pivots  at  stage  k  ,  A:  +  l  ^  A:  <  n~l.  't'he  pivot  at  stage  k  is 

pivots  of  stage  k  are  ,  and  either  r^k"/n  ] 

Proof;  Hy  definition  of  k,  the  rouling  patlis  from  stage-  t)  U>  A  -  !  comsi  ;t  of  only 

straight  links.  From  Theorem  3.2,  itiere  exists  a  unicpjr  path  frotn  .''ta:;e  0  to 
stage  k  and,  therefore,  the  set  of  pivots  at  stage  k,  0  A*  A:,  consists  of 
exactly  one  pivot.  Existence  of  exactly  two  pivots  at  stage  k  ,  k  i  \  ■_  k  <  ii  1. 
and  that  their  distance  is  2  follow  immediately  from  tlie  single  theoreiti  m  1  ii  . 
Since  the  lADM  network  functions  like  an  l(!ube  network  win  i,  i  very  -'.sitch  m 
the  lAD.M  network  is  set  to  state  t',  i^  S. •  b  k  <  u  -1,  s  on  a 

routing  path  l6j{l5j;  the  lemma  follows.  □ 

Lemma  A2.1  captures  a  simple  characteristic  of  routing  in  the  I.MtM  net¬ 
work  and,  for  each  source/destination  pair,  it  allows  the  discussion  to  focus  r  nly 
on  the  behavior  of  the  pivots  at  each  stage.  A  pivot  is  unri  arhnhle  if  :ill  its  par¬ 
ticipating  input  links  (defined  in  Section  3)  are  blocked,  and  it  is  r/o.sed  if  all  its 
participating  output  links  are  blocked.  \  pivot  <>1  a  lower-mder  .stag'.-  caii  l'“ 
closed  due  to  the  closure  of  pivots  at  liigher-ordcr  stages.  Likewise,  a  pivot  of 
higlier-order  stage  can  be  unreachable  due  iinreachability  of  pivc'ts  at  lower- 
order  stages.  From  tiie  definition  of  a  pivot,,  an  imiiortant  lemma  which 
identifies  the  causes  for  the  absence  of  blockage-free  paths  between  a 
source/destination  pair  is  staled  as  follows. 


■  I  I  ■  ^  I  I  —  I  I  I  I  I  <  1 1  ■  ■  W  H  III!  1  I  II  I  I  I  I  Ml . I  II II  I  I  I  I  ■■III 
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I.pmma  A2.2  In  the  lADM  network,  for  a  given  source/destination  pair,  if  all 
pivots  of  some  stage  are  closed  or  unreachable,  there  exist  no  blockage-free 
paths  for  the  source/destination  pair.  □ 

Lemmas  A2.1  and  A2.2  describe  the  behavior  of  the  switches  and  the  links 
in  the  set  of  routing  paths  for  each  source/destination  pair.  These  lemmas 
make  it  possible  to  ignore  switches  other  than  pivots  and  links  other  than  parti¬ 
cipating  links  at  each  stage  for  a  source/destination  pair.  These  results  greatly 
simplify  the  complexity  of  rerouting  in  the  LADM  network. 

The  correctness  proof  for  algorithm  BACKTRACK  consists  of  two  parts. 
First  is  that  the  path  found  by  the  algorithm  is  a  valid  path  leading  to  the  des¬ 
tination  and  capable  of  avoiding  blockages  in  the  network.  Second  is  that  algo¬ 
rithm  BACKTRACK  always  finds  a  rerouting  path  if  there  exists  any,  which  is 
equivalent  to  that  algorithm  BACKTRACK  returns  FAIL  only  if  there  exist  no 
blockage-free  paths.  To  prove  these  two  parts,  it  requires  examination  of  the 
conditions  that  terminate  algorithm  BACKTRACK. 

The  rerouting  path  found  by  the  algorithm  can  route  the  message  to  the 
destination  because  the  destination  bits  of  the  rerouting  tags  equal  to  the 
binary  representation  of  the  destination  address.  The  rerouting  path’s  ability 
to  evade  blockages  is  a  natural  consequence  of  Corollary  4.2,  on  which  steps  3 
and  10,  the  only  steps  in  the  algorithm  that  generate  rerouting  tags,  are  based. 
Notice  that  step  6  returns  the  rerouting  tag  if  the  rerouting  path  found  from 
step  3  or  10  is  blockage-free. 

The  steps  that  return  FAIL  are  steps  1,  4a,  4b,  5,  8  and  9.  Steps  1  and  8 
return  FAIL  because  no  alternate  routing  paths  exist.  Steps  4a,  4b,  .'i  and  9 
return  FAIL  because  the  communication  between  the  source  and  the  destination 
is  broken  due  to  the  blockages  in  the  network.  So  it,  is  impossible  for  a 


blockage-free  path  to  exist  without  algorithm  BACKTRACK  finding  it  and  not 
returning  FAIL.  Validity  of  steps  1  and  8  was  discussed.  'I’herefore,  the  proof 
for  the  second  part  is  complete  if  steps  4a.  4t),  .S  and  9  are  verified. 

Proof  of  steps  .^o  and  4b 

In  the  following  discussion  for  steps  la  ami  4l),  only  the  case  where  Itnk 
found  —  1  is  explored;  the  cases  where  linkfoutui  0  ran  he  t  re;',  ted  an.'ilo- 
gously.  In  Figure  5  [linkfound  --  1  and  q  ~  t),  the  hloekage  at  stage  q  on  path 
P  is  a  straight  link  blockage  and  the  link  at  -.stage  q  on  the  rerouting  jiatli  is 
chosen  to  be  ((74-2“^  .i)  by  setting  b  =  ci^  (Lemma  A1.2). 

A  blockage  in  {{j+2'^)ES^  »  (j +2'^  ‘  ‘ ; ,)  ran  be  overcome  by  rerouting  the 
message  through  the  other  nonstraight  link  (( J  I  2'' This  is  done 
by  complementing  b  If  ((j +2“' blocked,  links 

(je5,  ,  '  2^6;  .,)  and  {{j  .  (; 4  2*^  * ‘)€5,  , ,)  are  all 

blocked,  thus  both  pivots  at  stage  q,  and  (j  •i-2^  ,  arc  closed.  Hence  no 

blockage-free  paths  exist.  The  above  explains  step  4a.  In  Figure  G  (link- 
found  =  1  and  q  —  i)  both  nonstra'ght  links  of  jES^  on  path  /', 
{jESq  ,  (>-2‘')G5,  +  i)  and  [jES^  .  (.7 +2'' )G-Sq  ,  ,)>  blocked  and  thus  pivot 

jE:S^  is  closed.  If  ((y4-2’ ,  (jd  2'^)G-S'^  , ,)  is  also  blocked,  pivot  {j+2‘^)ES^ 
is  also  closed.  Because  both  pivots  of  stage  q.  jES^  and  {j+2‘^)ES^,  are  closed, 
there  exist  no  blockage-free  paths.  This  explains  step  lb.  □ 

The  scope  of  the  correctness  proof  for  steps  6  and  10  is  limited  to  the  case 
where  the  first  nonstraight  link  found  in  backtracking  is  -2’^  (linkfound  =  1) 
and  assumes  that  the  blockage  at  stage  i  is  a  double  nonstraight  link  blockage. 
Discussions  for  the  cases  where  link  42*^  is  the  first  nonstraight  link  found  in 
backtracking  and  where  the  blockage  at  stage  i  is  a  straight  link  blockage  can 
be  treated  analogously. 


An  interesting  property  regarding  the  behavior  of  the  pivots  at  each  itera¬ 
tion  of  backtracking  is  discussed  here.  This  is  to  be  used  in  the  correctness 
proof  for  steps  5  and  9.  The  discussions  are  associated  with  Figures  5  and  6  for 
q  =  i  and  r  =  i—k.  Since  the  links  on  path  P  from  stage  r-|-l  to  9  — 1  are  all 
straight  links,  by  Theorem  3.2,  there  exist  no  alternate  routing  paths  from 
to  jG-S, .  So  the  closure  of  would  effectively  close  every  pivot 

j&S[,  r+l  <  /  <  q—l.  Hence  if  ]£S^  is  closed,  every  jESi,  r-f-l  S  f  <  9,  is 
closed.  Due  to  the  closure  of  ((y+2’’)G5,.  ,  is  blocked.  If 

{{j-{-2^)ES^  ,  (j-|-2'^'*^')GS,._^j)  is  also  blocked  (step  6),  both  participating  output 
links  of  (j+2'^)G5,  are  blocked  and  thus  {j  +2' )ESj.  is  dosed.  After  j  and  q  are 
updated  in  step  7  (i.e.  j  •*—  i-l-2’^  ,  q  *—  r  so  that  ( j-|-2’^ )GS',.  becomes  jES^), 
the  same  type  of  blockage  recurs  (i.e.  both  nonstraight  output  links  of  jES^ 
are  blocked  and  thus  jESg  is  closed)  as  that  which  took  place  when  the  algo¬ 
rithm  was  first  entered  (i.e.  q  =  i)  and  thus  a  new  iteration  of  backtracking 
begins.  For  convenience  of  reference,  the  property  described  in  this  paragraph 
is  formally  restated  as  a  lemma. 

l.emma  A2..3  In  each  iteration  of  backtracking  in  algorithm  BACKTRACK,  on 
path  P  every  pivot  jG-S),  r+1  ^q,  is  closed;  if 

((y-|-2'^)G5r  I  (j+2'^  ^  *)G5,.^,)  is  also  blocked,  (j+2’^)G5r  is  also  closed.  □ 


Beginning  from  the  second  iteration  of  backtracking,  the  link  of  stage  q  on 
the  rerouting  path  is  always  a  straight  link,  since  the  blockage  at  the  onset  of 
each  iteration  of  backtracking  is  always  that  both  nonstraight  output  links  of 
jES^  are  blocked  (Figure  6).  Hence  only  step  4b  is  concerned  in  checking  the 
blockages  of  stage  q  on  the  rerouting  path.  As  a  result,  in  Figures  5  and  6 
{link found  =1),  the  links  on  path  P  from  stage  r  to  stage  i  consist  of  only 
straight  links  and  negative  nonstraight  links;  correspondingly,  the  links  on  the 


'  •/ 


rerouting  path  from  stage  r  to  stage  t  consist  of  only  straight  links  and  positive 
nonstraight  links.  .Similarly,  for  linkfouiid  -=  0,  the  links  on  path  P  froiu  stage 
r  to  stage  i— 1  consist  of  only  straight  links  and  positive  nonstraight  links; 
correspondingly,  the  links  on  the  rerouting  path  from  ‘-tage  r  to  stage  i  1  i-on 
sist  of  only  straight  links  and  negative  nonstraight  links. 

Proof  of  step  5 

Proof  of  step  5  is  illustrated  in  Figure  6  for  q  i  and  r  i—k.  lU cause  of 
Lemma  A2.2,  it  suffices  to  show  that,  in  each  iteration  of  backtracking,  a  don 
ble  nonstraight  link  blockage  at  stage  q  and  an  additional  link  blockag'  ii' 
((j+2')G^‘,  ,  (j+2' ‘ ')C  S',  ,  for  some  /,  r  +  l  <  /  £i  c/  1,  effectively  <  loc  pivo! 

and  make  ( j +2^  )G‘F;  „ ,  unreachable.  From  1  c  ihnia  A'-MF  lo;  pa'ii  /’ 

A 

every  pivot  j'G5/4.i,  r  +  l  <  /  <  q—l,  is  closed.  On  path  0,  if  a  link  olockage 
also  occurs  in  (( j +2^ )G-.S'|  ,  (j +2^ ' ')G'S'p,  i).  pivot  2' * ')G>'/ ,  i  iH  rorni 
unreachable  unless  jG-S';,  the  other  pivot  at  stage  i,  is  also  coniuctcd  lo 
(j+2^''’*)G>S'^.,.,.  This  would  occur  only  if  link  +2^*'  is  a  legitimate  link  a'  stage 
/,  i.e.  2^"*^*  •—  2"  =  0  rtwd  2'^  (a  straight  link).  But  I  <  q  S 
/+1  +=  n.  q  <n  — 1  since  q  is  the  stage  number  at  which  a  output  lint  is 
blocked  and  stage  n  — 1  the  last  stage  that  has  output  links.  Because  pivot 
iG5p,|  is  closed  and  (.;  +2^  is  unreachable,  there  exist  no  blockage-free 

paths.  O 
Proo f  of  step  9 

From  Lemma  A2..3,  at  the  end  of  each  iteration  of  backtracking,  I  ;  t  2'^)G‘S  and 
jGS’,.^j  are  closed.  After  a  new  iteration  of  backtracking  starts  and  ste[)  7  is 
executed,  (j+2’^)GSV  is  relabeled  as  j(zS^  and  jG'F,  ,i  is  relatuded  as 
(j— 2*^  )G5',j +  1.  So  the  condition  that  j£S^  and  ( J  ^2*^  )G.S’, , ,  are  both  cle  ed  is  a 
priori  in  the  beginning  of  the  new  iteration.  Since  ( 2*^  )GS,,  ,  ^  '  i’*' 


pivots  at  stage  9+I,  is  closed,  any  rerouting  path  must  pass  through 
( J "1-2“* )G*S’,+i,  the  other  pivot  at  stage  9+I.  It  is  shown  below  that  such  a 
rerouting  path  does  not  exist  if  the  nonstraight  link  at  stage  r  found  in  back¬ 
tracking  is  -1-2’^.  The  proof  is  illustrated  in  Figure  9.  The  current  routing  path 
is  ((j— 2’’)6S’,.  ,  ,  ■  ■  ■  ,  ,  (j— 2'’)G5^^ j)  and  there  exists  a  rerout¬ 
ing  path  ((j-2^)G5,  ,  ■  •  •  ,  ,  ,  {j-2‘^)eS^  ,  (j -2’ Thus 

{j  —2“^  )EiSg  and  are  the  two  pivots  at  stage  q.  Since  pivot  is  closed, 

any  rerouting  path  must  pass  through  pivot  (j—2'^)ES^.  But  (j— 2‘^)G*S'^  is  not 
connected  to  (j+2^)G5^^j  since  link  4-2'*"'^*  is  not  a  legitimate  link  at  stage  q, 
0  ^  q  <.  n  — 1.  Therefore,  no  paths  that  pass  through  (y— 2^)6*?,  and 
{j-\-2‘^)ES^J^^  exist.  Note  that  although  further  backtracking  to  a  still  lower- 
order  stage  is  possible,  as  long  as  the  nonstraight  link  at  stage  r  found  in  back¬ 
tracking  is  4-2*^,  the  two  pivots  at  stage  q  never  change.  That  is,  further  back¬ 
tracking  will  not  result  in  a  path  that  passes  through  {j—2‘^)ESg  and 
(;+2*)€S,„.  □ 


straight 


exchange 


Figure  1.  The  Indirect  Binary  N-Cube  (ICube)  network  for  N=8  (according  to 
the  first  graph  model);  two  possible  states  for  each  box  are  shown  (i.e. 
straight  and  exchange). 
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Figure  5.  Rerouting  for  a  straight  link  blockage  in  fj  ^i)-  Path 

,  ••  is  a  sogiiunt  of  the  original 

path;  ((j+2'-*K.S,_*  ,(;42'  S  ,  ■  ■  .(j  ‘•P|.  S,  ,  and 

{{j+2’-*)e5._*  ,(j  t2'  c,  .  •  .(j*2'K.S  .(;  arc 

the  rerouting  paths  for  it. 


.(1-V '”.4*.  V. 


j£S,  jes, 

D--D 


(j+2"'‘"")€5._*;k 


{]+2‘)£Si 


(y+2'^>)€^. 


(y+2')€5.  (y+2')G5,^, 


Figure  6.  Rerouting  for  a  double  nonstraight  links  blockage  in 
(jeS,  ,(j-2’)G5.,,)  and  (j€5,  ,  (j+2’)6S,^,).  Path 

((j  ,  (j  ,  •  •  •  ,  (j+2’)es.  ,  (j-f2')€5.,J  is  a 

rerouting  path  for  both  paths 

((j+2’-*)€5,  *  ,  ,  ■  ■  ■  ,  y€5,„  ,  (y-2-)e5,,,)  and 


Figure  9.  Rerouting  for  nonstraight  output  link  blockages  of  switch  j(zS^.  The 
original  path  is  ((j-2'}G%  .  .  '  '  '  ,  ,  (j  and  the 

nonstraight  link  found  in  backtracking  is  a  positive  link 
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ABSTRACT 


Augmented  data  manipulator  networks  are  multistage  inter¬ 
connection  networks  which  implen.ent  at  each  stage  interconnec¬ 
tion  functions  present  in  the  single  stage  network  known  as  PM2I 
network  or  barrel  shifter.  These  multistage  networks  include  the 
ADM  (Augmented  Data  Manipulator)  and  lADM  (Inverse  Aug¬ 
mented  Data  Manipulator)  networks,  which  have  been  extensively 
studied  and  proposed  for  use  in  multiprocessor  systems.  This  paper 
derives  new  partially  augmented  networks  based  on  the  solution  tc 
the  shortest  path  problem  in  the  PM2I  network.  The  new  net¬ 
works  include;  the  HAD.M  (Half  Augmented  Data  Manipulator) 
and  HIADM  (Half  Inverse  Augmented  Data  Manipulator)  networks 
which  have  half  the  number  of  stages  of  the  ADM  and  LADM  net¬ 
works,  the  MADM  (Minimum  Augmented  Data  Manipulator)  and 
the  MIADM  (Minimum  Inverse  Augmented  Data  Manipulator)  net¬ 
works  which  have  the  minimum  link  complexity  required  for  one- 
to-one  connections  in  a  network  of  siie  TV  with  log, A’  stages  of  uni¬ 
form  switches,  and  the  Extra  Stage  MADM  and  MIADM  networks 
which  are  fault-tolerant  versions  of  the  MADM  and  MIADM  net¬ 
works  that  can  tolerate  at  least  three  switch  failures.  The  deriva¬ 
tions  of  these  networks  are  presented  and  their  properties  and 
advantages  over  other  designs  are  analysed. 
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1.  Introduction 

Multistage  interconnection  networks  are  often  designed  by  implementing 
at  each  stage  interconnection  functions  characteristic  of  a  single-stage  network. 
This  paper  proposes  new  multistage  networks  which  offer  advantages  over  previ¬ 
ously  known  designs  based  on  the  PM2I  network  |Sie77].  The  new  networks  are 
derived  from  the  solution  to  the  shortest  path  problem  in  the  PM2I  network. 
Further  analysis  leads  to  the  derivation  of  designs  with  minimal  link  complexity 
and  fault-tolerance. 

The  plus-minus  2'  (PMSI)  network  (Sie77]  is  a  single-stage  network  defined 
by  the  PM21  interconnection  functions: 

PM2I^,(S)  =  (5  +  2')  mod  N  0  <  i'  <  n-1 

PM2I_,(S)  =  (5  -  2')  mod  N  0  <  i'  <  n-1 

where  N  =  2"  corresponds  to  the  number  of  network  nodes  and  S, 
0  <  S  <  N—1,  denotes  a  node  address.  Thus,  in  the  PM2I  network  there  exist 
links  from  a  node  5  to  nodes  PM2I^,{S),  0  <  i  <  n  —  1,  as  well  as  links  to  nodes 
PA/2/_,(5),  0  <  I  <  n—1.  These  links  are  referred  to  as  the  -f-2'  linkt  and  —2' 
linkt,  respectively.  A  PM2I  network  of  N  =  8  nodes  is  illustrated  in  Figure  1. 

The  class  of  data  manipulator  networks,  introduced  in  [Fen74],  are  con¬ 
structed  based  on  the  PM2I  functions.  It  includes,  among  others,  the  Augmented 
Data  Manipulator  (ADM)  network  (SiS78],  the  lADM  network  [McS82)  and  the 
Gamma  network  [PaR82j[PaR84j.  The  lADM  network  and  the  ADM  network 
differ  only  in  that  the  input  side  of  one  of  them  corresponds  to  the  output  side 
of  the  other  and  vice  versa.  The  Gamma  and  the  lADM  networks  are  topologi¬ 
cally  equivalent;  however,  they  use  switches  of  different  types.  Each  3x3 
crossbar  switch  used  in  the  Gamma  network  can  connect  simultaneously  aU 
three  inputs  to  all  three  outputs  whereas  each  switch  used  in  the  LADM  network 
can  connect  only  one  of  its  three  inputs  to  one  or  more  of  its  three  outputs. 

The  ADM  network  is  composed  of  n  =  logN  stages  labeled  from  0  to  n  —  1 
from  the  output  side  to  the  input  side.  Each  stage  consists  of  3N  connection 
links  and  N  switches.  The  switches  are  labeled  from  0  to  N-1  from  the  top  to 
the  bottom.  An  extra  column  of  switches  u  appended  at  the  end  of  the  last 
stage  and  is  referred  to  as  stage  n.  Each  switch  j  of  stage  i-hl  has  three  out¬ 
put  links  to  switches  {j—2')  mod  N,  j  and  (j-(-2')  mod  N  of  stage  The  link 
joining  j  of  stage  «-)-l  and  j  of  stage  t  is  called  a  straight  link,  the  link  joining 
{j—2')  mod  N  of  stage  i  +  l  and  j  of  stage  •  is  a  plus  (+2')  link  [McS82],  and  the 
link  joining  (j-^2')  mod  N  of  stage  i  +  l  and  j  of  stage  t  is  a  minus  (—2')  link. 
Each  switch  selects  one  of  its  input  links  and  connects  it  to  one  or  more  output 
links.  Figure  2  illustrates  an  ADM  network  of  sise  N=S. 
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stage  0  12  3 


Figure  2.  The  ADM  network  for  N  =  8 


Because  the  only  difference  between  the  ADM  and  lADM  networks  is  that 
their  input  and  output  sides  are  reversed,  the  stages  of  the  lADM  network  are 
labeled  from  0  to  n-1  from  the  input  side  to  the  output  side.  Each  switch  j  of 
stage  I  in  the  lADM  network  is  connected  to  switches  {j-2')mod  TV,  j  and 
(j  +  2')  rnod  N  of  stage  i  +  l.  A  plus  link  in  the  LADM  network  from  switch  j  of 
stage  I  is  connected  to  switch  j-^2'  of  stage  i  +  l  is  the  same  link  as  the  niinus 
link  in  the  ADM  network  from  switch  j+2'  ot  stage  i+l  to  switch  j  of  stage  i. 
Similar  relationship  applies  to  a  minus  link  in  the  LADM  network  and  a  plus 
link  in  the  ADM  network.  Due  to  the  reversal  of  the  input  and  output  sides  of 
the  ADM  and  LADM  network,  stage  i  of  the  ADM  network  corresponds  to  the 
switches  of  stage  i  of  the  LADM  network  and  the  links  of  stage  i-1  of  the 
LADM  network. 

The  results  of  this  paper  are  based  on  the  study  of  shortest  path  problem 
in  the  PM21  network.  The  solution  to  the  shortest  path  problem  for  the  PM2I 
network  is  derived  from  an  algorithm  (PaR82j  that  generates  routing  tags  for 
the  Gamma  network.  Because  the  LADM  and  Gamma  network  are  topologically 
equivalent  and  the  ADM  and  LADM  networks  differ  only  in  their  input  and  out¬ 
put  sides,  the  results  in  this  paper  apply  to  all  of  these  networks.  However,  the 
main  interest  of  this  paper  is  the  study  of  the  ADM  network  and  the  discussions 
are  centered  on  the  properties  of  the  ADM  network. 

Given  a  string  of  n  digits,  t  =  l^t,  •  •  •  t„_,,  the  notation  denotes  the 
digits  of  1  starting  at  t,,  and  ending  at  t,.  Throughout  this  paper,  j  and  j-+a 
(where  a  is  some  constant)  represent  labels  of  switches.  AJso  modulo  N  arith¬ 
metic  is  assumed,  e.g.  j+a  implies  (y  +  a)fnoJ  N.  The  notation  /  is  used  to 
indicate  that  a  switch  j  belongs  to  stage  i  and  (j''^'  i  /  )  is  used  to  represent  a 
link  joining  and  /.  A  sequence  of  switches  of  contiguous  stages 

•  ,  Z)  i»  '■o  represent  a  path  from  Z"^*  to  Z- 

Section  2  of  the  paper  considers  the  formulation  and  solution  of  the  shor¬ 
test  path  problem  for  the  PM2I  network.  In  Section  3  these  results  are  used  to 
derive  new  networks  that  require  less  hardware  complexity  and  transmission 
delay  than  other  known  augmented  data  manipulator  networks.  These  new 
networks  are  called  partiaily  augmented  data  manipulator  networks.  Details  of 
routing  schemes  for  these  networks  are  also  discussed  in  Section  3.  Fault- 
tolerant  topologies  are  proposed  in  Section  4  by  adding  an  extra  stage  to  these 
networks,  with  the  result  that  four  disjoint  paths  exist  between  any  source  and 
any  destination  in  the  networks.  Section  5  concludes  the  paper. 


2.  Shortest  Path  Problem  in  the  PM2I  Network 

Given  a  source  node  5  and  a  destination  node  D  in  the  PM21  network,  the 
shortest  path  problem  is  to  find  a  path  from  5  to  D  which  contains  a  minimal 
number  of  links.  When  circuit  switching  is  used  for  communication  between 
nodes,  delays  are  identical  for  any  link  and  transmission  delay  is  directly  pro¬ 
portional  to  the  number  of  links  on  a  path.  Thus,  the  shortest  path  is  also  the 
one  for  which  transmission  delay  is  minimum. 

Given  a  source  node  5  and  destination  node  D  in  the  PM2I  network, 
define  dittarue  A  to  be  (D—S)  mod  TV;  thus  the  range  of  A  is  0  <  A  <  (TV— 1). 
Routing  from  a  source  5  to  a  destination  D  in  the  PM2I  network  can  be 
characterized  by  the  eombiruition  tag  =  tot,  •  •  •  t.,,_|  such  that 

A  =  (‘x;’t,2-  +  mod  2-  (l) 

where  A  is  the  distance  from  the  source  S  to  the  destination  D  and  t,  ’i  are  non¬ 
negative  integers.  A  positive  value  of  t,  indicates  that  link  +2',  for 
0  <  i  <  n-1,  or  link  —2'“*,  for  n  <  t  <  2n  — 1,  is  used  in  the  routing  path 
whereas  t,  =  0  indicates  that  the  link  is  not  used.  A  combination  tag,  as  sug¬ 
gested  by  its  name,  specifies  a  combination  of  PM2I  links  that  can  be  used  to 
cover  the  distance  between  the  source  and  the  destination.  However,  the  combi¬ 
nation  tag  (o/,-i  does  not  specify  the  sequence  in  which  the  links  are  used. 
Several  distinct  paths  can  be  derived  from  a  combination  tag  and  all  these 
paths  contains  the  same  number  of  links.  Since  the  combination  tag  depends 
only  on  the  distance  A,  it  is  often  identified  as  a  eomtination  tag  of  distance  A. 
A  shortest  path  is  specified  by  a  combination  tag  for  which  the  number  of  links 

■J"  -  I 

^  is  minimum  and  the  problem  of  finding  such  a  tag  •  caUed  minimtim 
1^1 

weight  combination  tag  •  can  be  stated  as  follows: 

Problem  (P)  Find  t'  =  t  'o/,_i  wch  that 
=  min  ■£'  t,  = 

i-ll  1-0 

n-l  2»-l 

subject  to  A  =  (5]t,2'  +  t,(-2'“’'))  mod  2" 

0  <  ‘,  0  <  *  <  2n-l 

0  <  A  <  2*-l 

A  feasible  solution  to  this  problem  corresponds  to  a  combination  tag,  and  an 
optimal  solution  to  it  corresponds  to  a  minimum  weight  combination  tag.  For 
convenience  of  discussion,  the  terms  (i)  a  feasible  solution  and  a  combination 
t*g.  and  (ii)  an  optimal  solution  and  a  minimum  weight  combination  tag  are 
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used  interchsngesbiy . 

The  next  two  lemniss  reduce  the  site  of  the  set  of  feasible  solutions. 

Lemma  2J.  If  <  *  is  the  optimal  solution  to  (P),  then  ) ,  0  <  i  <  2n  -  1. 

Proof:  The  proof  is  by  contradiction.  Assume  that  the  optima)  solution  t  con¬ 
tains  </ >  2,  for  some  k,  0  <  k  <  n-2  or  n  t  <  2n-2.  Then  there  exist 
alternate  paths  that,  compared  with  paths  debned  by  t  ,  reduce  traversal 
through  link  -l-2*  (or  —2*^'")  twice  and  increase  its  traversal  through  link  +2“^*’ 
(or  —2*''^'”")  once;  i.e. 

<*'2*  4-  C,2*-^'  =  (i;-2)2‘  -t-  (C,+l)2‘-^' 

Comparing  with  the  total  delay  of  the  path  defined  by  t  ,  the  total  delay  of  the 
alternate  paths  is  reduced  by  one,  which  is  contradictory  to  the  hypothesis  that 
t*  minimises  the  routing  delay.  If  It  =  n-1  (or  2n  — l)  such  that  t„_,  >  2  (or 
t;„_|  >  2),  a  carry  is  generated  in  the  highest  order  digit  and  is  discounted 
by  two,  denoted  f„_,  =  <,*_i-2.  The  carry  vanishes  due  to  (mod  2")  operation 
and  the  total  delay  of  the  alternate  paths  is  reduced  by  two;  again  a  contradic¬ 
tion  results.  □ 

Lemma  2Li!  If  t "  ia  the  optimal  solution  to  (P),  then  t,  =  0,  0  <><  n  - 1 ; 
i.e.  the  shortest  path  between  any  source  and  any  destination  in  the  PM21  net¬ 
work  cannot  contain  both  link  +2'  and  link  -2’  for  any  «,  0  <  i  <  n-1. 

Proof:  The  proof  La  by  contradiction.  Suppose  the  opposite  is  true.  Fton 

Lemma  2.1,  a  digit  of  the  tag  representing  a  shortest  path,  car.  only  be  0  or  1: 

by  assumption  of  having  both  -i  2'  and  —2’  links  on  the  routing  path 

1,”  =  <,*+,  =  1.  The  effects  of  +2'  and  —2'  cancel  each  other.  Thus  the  vab  e> 

for  t,"  and  can  be  substituted  by  0  and  still  satisfy  the  equality  constraint 
in  (P)  (also  equation  (l)).  The  routing  delay  is  thus  reduced  by  two.  A  contrad¬ 
iction  results.  □ 

From  Lemma  2.2,  either  I,'  or  !„+.  is  lero,  0  <  i  <  n  -1,  so  that  the  two 

«-(  2it-i 

sums  5^4,2'  and  4, (—2'"")  in  equation  (l)  can  be  combined  to  form  ^4,2', 

1-0  I -II  1-^' 

with  the  extension  of  the  values  for  4,  to  include  negative  integers.  The  result 
in  Lemma  2.1  confines  the  values  for  each  4,  of  a  tag  representing  a  shortest 
path  to  be  0  and  1.  Together  with  the  necessary  extension  to  include  negative 
integers,  the  possible  values  for  4,  of  an  optimal  solution  are  -1,  0  or  1.  Thus, 
the  problem  of  finding  a  minimum  weight  combination  tag  can  be  reformulated 
as  follows: 


Problem  (P)  Find  i‘  =  I  such  that 
=  min  5]  K I  =  E  l‘.*l 

iMi  imii 

fi  —I 

subject  to  A  =  (  mod  2" 


for  0  <  •  <  n-1 


0  <  A  <  2’-l 

A  branch-and-bound  approach  is  used  to  find  the  optimal  solution  for  (P), 
which  is  also  a  minimum  weight  combination  tag.  This  approach  is  based  on  an 
algorithm  proposed  in  [PaR82]  that  can  find  all  signed-digit  representations  for 
the  distance  between  any  source  and  any  destination  in  the  Gamma  network. 
Each  signed-digit  representation  corresponds  to  a  routing  tag  for  the 
source/destination  pair.  Moreover,  since  the  lADM  network  and  the  Gamma 
network  are  topologically  equivalent,  the  routing  tags  generated  by  the  algo¬ 
rithm  are  also  valid  routing  tags  for  the  lADM  network.  The  Gamma  network 
is  constructed  based  on  the  PM2I  functions  and  a  routing  tag  uniquely  specifies 
a  path  in  it.  In  particular,  each  stage  is  composed  of  2*  switches,  and  at  each 
stage  i,  0  <  I  <  n  — 1,  each  switch  b  connected  to  three  output  links  +2',  —2' 
and  straight  link,  and  only  one  of  them  is  on  the  routing  path;  in  addition,  the 
path  in  the  Gamma  network  traverses  a  distance  of  {(D—S)  mod  2'}  from  S  to 

»— I 

D.  These  corresponds  to  the  constraints  in  (P):  A  =  ( 52<,2’)  mod  2",  t,  €  {*1» 

0,  1}  and  -(2'-l)<A<2’-l.  Thus  a  routing  tag  that  specifies  a  path  from 
5  to  D  in  the  Gamma  network  is  abo  a  feasible  solution  to  (P).  Note  that 
t,  =  0  indicates  that  a  straight  link  b  used  at  stage  i  for  routing  in  the  Gamma 
network. 

A  routing  tag  for  the  Gamma  network  can  be  converted  to  a  combination 
tag  for  the  PM2I  network:  if  the  i-th  bit  of  the  routing  tag  is  1,  t,  =  1,  if  it  b  I, 
=  1  (hereafter  the  signed-digit  representation  I  [AviOl]  b  used  to  represent 
-1),  and  if  it  b  0,  <,  =  =  0.  A  combination  tag  satbfying  conditions  (a) 

snd  (b)  (,'(,»  ^  0  can  also  be  converted  to  a  routing  tag  for  the 
Gamma  network:  if  (,  =  1,  the  i-tk  bit  of  the  routing  tag  b  1,  if  =  1,  the 
i-tA  bit  of  the  routing  tag  b  I,  and  if  (,  =  t,>,  =  0,  the  i-tk  bit  of  the  routing 
tag  is  0.  The  optimal  solution  to  (P)  certainly  satbfies  conditions  (a)  and  (b) 
and  is  also  a  minimum  weight  tag.  Because  we  are  only  interested  in  the  shor¬ 
test  path  (which  can  be  characterised  by  a  minimum  weight  combination  tag) 
in  the  PM2I  network,  given  the  one-to-one  correspondence  between  a  minimum 
weight  routing  tag  and  a  minimum  weight  combination  tag,  they  are  used  inter¬ 
changeably.  The  algorithm  in  [PaR82]  b  stated  as  follows. 


for  i  —  0  to  n  -  I  do 


if  .  is  even  then  (.  -0  ,  ^ ■ 


else 


endif 


1  , 


Ji. 


I 

2 

‘i.  +  1 
2 


enddo 


In  the 

algorithm,  1 

(,  is  un 

iquely  determined 

(-0) 

if 

*^1 

,  is  eve 

p.  where, 

dom  exists 

in  choosing 

the  va 

ilue 

for  t, 

(l  or  i 

)  if  A, 

i£ 

odd. 

An  cxai 

shown  that 

generates  a 

11  tags 

for 

routing  from 

S  = 

1 

to 

D  = 

4  in  the 

network  of  site  N  -  8. 

In  this 

case,  A  - 

3  =  -5 

mod 

8, 

IM- 

to 

_i^L. 

t . 

(A  J 

t. 

:a; 

[31 

1 

Ill 

1 

S 

0 

(=3) 

|3l 

1 

[11 

I 

jlj 

1 

(==3) 

13] 

1 

11) 

T 

!i) 

T 

(-~5 

mod  8) 

[3] 

T 

12) 

0 

(i) 

1 

(-3) 

131 

T 

(21 

0 

Hi 

I 

!  (--5 

mod  8) 

As  mentioned  previously  the  focus  of  this  paper  is  the  ADM  network,  the 
tags  generated  by  algorithm  ALL-TAGS  can  also  be  used  in  the  ADM  network 
because  the  ADM  and  lADM  network  differ  only  in  that  the  input  side  of  one  of 
them  corresponds  to  the  output  side  of  the  other  and  vice  versa.  In  the  1AJ''M 
network  routing  is  from  a  switch  of  the  lowest  order  stage  to  the  highest  order 
stage  while  routing  in  the  ADM  network  is  just  the  opposite.  Therefore,  the 
lowest  order  digit  of  the  tag  is  first  examined  by  a  switch  for  routing  in  the 
LADM  network  and  the  highest  order  digit  is  first  examined  for  routing  in  the 
ADM  network.  At  stage  i,  0  <  i  <  n  — 1,  of  both  networks,  if  t,  is  0,  straight 
link  U  used  for  routing;  if  it  is  1,  link  +2'  is  used;  if  it  is  I,  link  —2'  is  used.  In 
particular,  routing  from  S"  to  D"  in  the  LADM  network  is  equivalent  to  routing 
from  D"  to  5"  in  the  ADM  network.  Let  be  the  tag  for  the  routing  from 

to  D"  in  the  LADM  network,  it  can  be  readily  verified  that  tag  where 

=  -1,,  0  <  »  <  n -1  can  be  used  for  routing  from  D"  to  S''  in  the  ADM  net¬ 
work,  The  two  tags  and  ,  represent  the  same  path,  with  different 

interpretations  in  the  .A.DM  and  lADM  networks.  For  example,  the  tags  110, 
ill,  iTl,  TOl  and  lOT  for  the  LADM  network  in  the  above  table  can  be 


P 


.v.v, 


^  V  I 


k  •  »  ■ 
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converted  to  TIO,  Ill,  Til,  lOT  and  101  for  the  ADM  network,  respectively. 
Figure  3  illustrates  the  routing  from  D-  =  4  to  S"  =  1  in  the  ADM  network 
using  these  tags. 

The  possibility  of  having  two  values,  1  and  1,  for  ,  if  A, is  odd  can  be 
used  to  find  the  optimal  solution  to  (P).  It  is  shown  below  how  to  choose  the 
value  for  (,  so  that  can  be  pre-determined  as  desired. 


Lemma  2*3  In  the  process  of  generating  tags  in  algorithm  ALL-TAGS,  if  A,  is 
odd,  it  is  always  possible  to  make  l,.,i  =  0  by  properly  choosing  the  value  for  t,. 

Proof:  Since  -  and  -  differ  exactly  by  one,  one  of  them  is  even  and 


Proof:  Since  -  and  -  differ  exactly  by  one,  om 

2  2 

the  other  is  odd.  Suppose  that,  without  loss  of  generality, 

A.h-I 

a  I  t  _*1  «  1 


is  even.  Then 


t,  can  be  chosen  to  be  -1  so  that 


,  which  makes  t,,.,  =  0.  □ 


For  example,  one  of  the  paths  illustrated  in  Figure  3  is  represented  by  a 
^*8  ^0/'.’  ~  distance  A  =  A,,  =  3;  in  this  case  is  chosen  to  be  1  so  that 

t,  =  0.‘ 

Theorem  2*4  There  exists  an  optimal  solution  ( ’  to  (P)  which  has  no  adjacent 
nontero  digits;  i.e.,  t,'+i =  0  Tot  0  <  ‘  <  n-2.  If  I,*.,  =  0  then  t'  is  the 
unique  optimal  solution  with  no  adjacent  noniero  digits;  otherwise,  there  exists 
another  optimal  solution  t'  with  no  adjacent  noniero  digits,  where  t'  =  t,', 
0  <  i  <  n-2,  and  t,'_,  =  . 

Proof:  The  proof  consists  of  three  parts.  Part  (i)  finds  a  minimum  weight  tag, 
part  (ii)  proves  the  uniqueness  of  the  minimum  weight  tag  when  =  0,  and 
part  (iii)  finds  another  minimum  weight  tag  if  *  0. 

(i)  An  algorithm  which  results  from  modifying  algorithm  ALL-TAGS  is  first 
given  to  construct  a  minimum  weight  tag;  it  is  followed  by  a  proof  of  its 
optimality. 

Algorithm  SHORTEST -PATH  (A,t 

Au=A 

for  I  =0  to  n  — 1  do 

if  A,  is  even  then  t,  =0  ,  A,^.|= - 

A,-l  ,  ^  A,-l 

else  if  -  is  even  then  t,  =  1  ,  A,*.,  = - 

else  t,  =  I  ,  A,  +  ,  = — - — 

endif 

endif 

enddo 

Since  the  set  of  tags  generated  by  algorithm  SHORTEST-PATH  is  a  subset  of 
those  generated  by  algorithm  ALL-TAGS,  algorithm  SHORTEST-PATH 
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Figure  3.  Routing  from  4'  to  1'  in  the  ADM  network  for  Af  —  8.  The  solid  lines  »re  the  links 
on  the  routing  paths  and  the  dotted  lines  are  other  links  of  the  ADM  network. 
Labels  on  the  links  are  digits  of  the  routing  tags. 


correcliy  generates  a  tag  of  distance  A.  It  remains  to  show  that  the  tag  has 
minimum  weight.  The  strategy  used  in  the  algorithm  SHORTEST-PATH  is  to 
generate  a  lero  digit  whenever  possible  (for  .A,  is  odd,  t,  is  chosen  to  be  such 
that  is  even,  which  makes  =  0).  To  sec  why  this  is  a  good  strategy,  let 

i  be  the  smallest  index  such  that  Jt,  is  odd,  and  let  t  and  t  be  the  solutions 
found  by  applying  this  strategy  and  and  by  not  complying  with  this  strategy, 

A,-l 

respectively.  Assume  that,  without  loss  of  generality,  — - —  is  even.  It  is 

shown  that  there  are  four  possible  cases  and  the  terminating  conditions  for  each 
case  can  be  continued  by  applying  the  discussion  for  one  of  the  four  cases  recur¬ 
sively. 

Case  1 

The  table  below  illustrates  the  discussion  for  case  1  based  on  this  assumption. 


1 

I'^.l  I 


A,-l  ,  .  . 

Since  — - —  (=  A,  +  |  for  t  ,  denoted  ))  is  assumed  to  be  even,  (,+,  =  0. 

,  A, -1-1 

Because  i^,  +  |(t)  =  — - —  is  odd,  there  are  two  possible  values,  1  and  I,  for  t,+,. 
If  <,+i=l,  A, *_.(<)  =  A, The  discussion  for  case  1  terminates  here. 

Cjusc  2 

The  alternative  is  that  <,.,,=1,  which  is  illustrated  in  cases  2,  3  and  4  in  the 

A,-l 

tables  below.  In  case  2,  -  is  assumed  to  be  even. 

4 


case  2 

f. 

‘■t. 

l^.rJ 

t‘ 

1-^.) 

1 

A,-l 

1  ,  1 

0 

A,-l 

1  4  1 

t 

1  l-^.l 

I 

i  2 

I 

A, +3 
'  4  1 

Case  2  terminates  here  with  A,^,._.(t')  being  even  and  A,+j(t)  being  odd. 

Case  3 

In  cases  3  and  4,  -  is  assumed  to  be  odd.  Case  3  is  illustrated  in  the  fol- 

4 

^.+3 

lowing  table  with  the  assumption  that  -  is  even. 
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To  conclude,  cases  1  and  3  have  the  terminating  conditions  that 
)=  Ji, ,,.(()  and  u,  ,  ((  )  -  respectively.  The  di.scussion  for 

Ji.(<  )  =  which  is  the  condition  where  the  discussion  for  all  cases  begin, 

can  be  applied  again  to  these  terminating  conditions.  In  cases  2  and  4,  the  let- 
minating  conditions  arc  that  (t  )  is  even  and  is  odd,  and  A  ..((  )  is 

even  and  is  odd,  respectively.  The  discussions  done  for  each  case  for 

iteration  « 1  when  )  is  even  and  is  odd  can  be  applied  again  to 

them.  Let  denote  the  number  of  noniero  bits  of  t,^.  In  case  1. 

|<  ,/,<.||  =■■  1  and  =  2;  in  case  2,  |(  =  2  and  |f,/,+.|  =  2;  in  case  3, 

I*  ./.  +  il  =  I  lL/,+il  =  2;  in  case  4,  =  2  and  =  2.  Thus  all 

possible  cases  are  exhausted  and  no  i  yields  a  lag  of  smaller  weight  than  t  . 

(ii)  Next  the  proof  of  uniqueness  for  the  tag  generated  by  algorithm 
SHORTEST-PATH  is  shown;  the  proof  is  by  contradiction.  Suppose  there  exists 
another  tag  i  ,/r-i  that  also  has  no  adjacent  nonsero  digits.  Let  i  be  the  lowest 
index  such  that  (,  t,  ;  thus  so  that  ^,[t)  ~  There  are 

three  possible  cases,  (a),  fb)  and  (c),  for  t,  *  t,  .  (a)  f,  =  1  and  (,  =  1  (or  vice 
versa);  then  t, *  =  0  =  since  t,  f, , ,  ==  0  and  -  0.  But  this  is  impos¬ 
sible  because  A,(()  A,(t  )  is  odd  so  that  only  either  t, ,  =  0  or  0 

(Lemma  2.3).  A  contradiction  results,  (b)  t,  =  1  and  t,  =  0  (or  vice  versa). 


Then  ^,(0  is  odd  and  A,(l  )  is  even.  But  this  is  impossible  because 
A,(t)  =  A,((  ).  A  contradiction  results,  (c)  t,  =  I  and  t,  =  0  (or  vice  versa). 
The  discussion  is  exactly  the  same  as  case  (b). 

(iii)  Existence  of  the  other  optimal  solution  is  shown  for  t„_|  =  1.  The  case 

n  — L’ 

that  =  T  can  be  treated  analogously.  If  =  1,  then  A  =  ( 52  ^ 

« —  It  — 

+  2"-'}  mod  2"  =  +  2"-'  -  2”)  mod  2"  =  (  V  t,'2’  -  2"-')  mod  2" ; 

t  i  I  W^l 

so  t„"_|  can  also  be  I  and  the  rest  of  digits  remain  unchanged.  O 

Actually  the  proof  of  Theorem  2.4  has  a  much  stronger  implication  regard¬ 
ing  optimality  of  a  tag  than  just  verifying  existence  of  a  minimum  weight  tag 
that  has  no  adjacent  nonrero  digits.  It  is  stated  as  Corollary  2.5. 


rornllary  2  5  A  feasible  solution  to  (P)  is  optimal  if  it  has  no  adjacent  nontero 
digits. 

Proof:  From  the  process  of  generating  each  digit  in  algorithm  SHORTEST- 
PATH,  the  feasible  solutions  with  no  adjacent  noniero  digits  are  either  unique 
or  different  only  at  (1  or  T).  There  exists  an  optimal  solution  to  (P)  that 
has  no  adjacent  nontero  digits.  So  the  feasible  solution  with  no  adjacent 
noniero  digits  must  be  also  an  optimal  solution,  n 


Corollary  2.5  only  guarantees  optimality  of  a  tag  that  has  no  adjacent 
noniero  digits;  a  tag  with  adjacent  nontero  digits  may  as  well  be  a  minimum 
weight  tag.  For  instance,  for  n  =  4,  A  =  —6,  the  tag  of  distance  A  can  be 
<,.y .  =  OTTO  or  <  =  OlOl;  both  tags  have  a  minimum  weight  of  two. 


rnrnllary  2.6-  The  maximum  number  of  links  on  the  shortest  path  in  the  PM2I 


network  from  any  source  to  any  destination  is 


i.e. 


max  = 

H  <  .i  <  |,V-1| 


n/2 


Proof:  From  Theorem  2.4,  there  exists  a  minimum  weight  tag  with  no  adjacent 
noniero  digits  for  every  distance  A.  The  maximum  number  of  noniero  digits  of 


such  a  minimum  weight  tag  is 


n/2 


;  i.e.  the  tag  consists  of  alternating  1  and  0 


digits.  □ 


Algorithm  SHORTEST-PATH  is  capable  of  finding  a  minimum  weight 
routing  tag  for  the  ADM  network,  which  can  be  converted  to  a  combination  tag 
for  the  PM2I  network,  and  also  deduces  that  the  number  of  hops  is  bounded 


above  by 


n/2 


This  knowledge  can  be  further  used  to  investigate  properties 


“An  equiTslent  remit  it  reported  in  |HwB84|  We  were  unable  to  identify  the  orifintl  reference  which  Brit 
reported  thit  result 
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of  the  ADM  network. 

•  .  Construction  of  Half  Augmented  Data  Manipi.  iator  Networks 

Corollary  2.6  indicates  that  the  shortest  path  between  any  two  nodes  in 
PM2I  network  uses  at  most  n /2  links,  which  implies  that  [t/2j  is  the  least 
number  of  stages  needed  in  a  multistage  network  based  on  rM21  functi  -ns. 
where  any  source  can  be  connected  to  any  destination  in  one  pass.  Further¬ 
more,  from  Theorem  2.4  it  is  possible  to  infer  how  such  a  network  can  be  con¬ 
structed.  For  convenience  of  discussion,  assume  n  to  be  even  hereafter.  The 
path  in  the  ADM  network  deSned  by  the  routing  tag  that  has  no  adjacent 
nonsero  digits  includes  only  one  of  the  links  -2“*',  -2  *,  ^2  *.  and 

straight  link,  for  every  stage  k,  0  <  i  <  (n/2)-3.  This  implies  that  the  links  of 
two  adjacent  stages  2k  and  2k  +  l  in  the  AD.M  network  can  be  coale.'^ced  into 
one  stage  and  thus  the  total  number  of  stages  is  reduced  to  n/2.  The  network 
is  called  Half  ADM  [HADKt)  network.  The  HADM  network  consists  of  n,2 
stages  ordered  from  0  to  (n/2)-l  from  the  output  side  to  the  input  side.  An 
extra  column  of  switches  is  appended  in  the  input  side  and  is  referred  to  as 
stage  n/2.  A  source  is  a  switch  at  stage  n/2  and  a  destination  is  a  switch  at 
stage  0.  Switch  j  of  stage  k  +  l  has  6ve  output  links  to  switches  of  stage  k: 
(j+2-^*'),  J,  {j  -2-*)  and  (>-2-*'^').  An  HADM  network  of  site  N  =  16 

is  shown  in  Figure  4. 

The  tag  generated  by  algorithm  SHORTEST-PATH  can  be  used  as  a 
routing  tag  in  the  HADM  network.  Close  examination  of  the  topology  of  the 
HADM  network  reveals  that  there  exists  latitude  in  using  tags  other  that,  the 
ones  with  no  adjacent  noniero  digits  to  control  routing  in  the  HADM  network; 
i.e.  two  adjacent  digits  of  a  routing  tag  can  be  both  nonrero.  Since,  for  a  given 
source/destination  pair,  only  one  of  the  links  -2"‘'",  -2'*,  -f2‘*,  42  and 

straight  link  is  used  for  routing  in  the  HADM  network,  as  long  as  the  tag 

satishes  the  constraint  that  =  0  for  0  <  k  <  (n/2)-l,  it  is  a  valid 

routing  tag  in  the  HADM  network.  There  are  6ve  possible  combinations  for 

such  a  pair  of  digits  01  and  01.  If  =  lO,  link  -2'^  is 

used;  if  t-ttjt  +  i  =  10,  link  4-2"*  is  used;  if  =  00,  straight  link  is  used;  if 

=  ®li  1*"*'  42'*'^'  is  used;  if  -2-*'^'  is  used.  The  rout¬ 

ing  tags  representing  the  same  distance  A  in  the  HADM  network  are  called  the 
equivalent  routing  tags.  The  multitude  of  equivalent  routing  tags  suggests  that 
there  may  exist  multiple  paths  for  some  sourcc/destination  pairs.  If  a  routing 
tag  has  no  equivalent  routing  tags,  it  is  unique,  and  only  one  routing  path  exists 
for  the  source/destination  pair. 

Recall  that  algorithm  SHORTEST-PATH  always  generates  a  lero  digit 
whenever  possible.  If  A,  is  even,  t,  is  uniquely  conhned  to  be  0;  if  A,  is  odd  (for 
which  t,  can  be  1  or  T),  then  it  chooses  the  value  for  t,  such  that  t,  *,  =•  0.  This 
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Figure  4.  The  HADM  network  for  =  16. 
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constraint  can  be  relaxed  for  generating  equivalent  routing  tags  for  the  HADM 
network.  For  =  0  and  ^  odd,  two  subsets  of  equivalent  tags  can  be  gen¬ 
erated  by  choosing  1  for  1  j.,,  for  one  of  them  and  by  choosing  T  for  t  ,  for  the 
other.  That  is,  if  =01  or  01,  both  1  or  I  can  be  considered  for  I  it. 

form  equivalent  routing  tags,  since  it’s  always  possible  to  make  t  lero  by 
properly  choosing  a  value  for  f  , ,  (l.etr.ina  2.3)  and  satisfy  the  constraint 
~  example,  there  are  two  paths  from  5-3 

to  D  =  13'  in  an  HADM  network  of  site  ,V  =  16,  which  are  specified  by  the  tags 
=  0110  (A  =  -6)  and  C„u  =  0101  (A  =  10),  respectively.  In  this  exampli, 
t,,t,  can  be  OT  or  01;  particularly  the  tag  t,  ,  -  OTfO  is  obtained  by  choosing 
t  ■  =  T  so  that  1,  --  0. 

Similar  to  the  relationship  between  the  ADM  and  LADM  networks,  the 
Half  LADM  (HIADM)  network  has  the  same  topology  as  the  HADM  network  with 
the  input  and  output  sides  exchanged.  A  tag  for  routing  from  5'  to  /' 

in  the  HADM  network  can  be  conveniently  converted  to  wl.tre 

t',  -  -t,,  0  <  »  <  rt  — 1,  for  routing  from  D''  to  in  the  Half  lADM  network. 
Note  that  tag  also  satisfies  the  constraint  t  ,,-t  -  0, 

0  <  t  <  (n/2)-l. 

It  was  shown  that  for  some  source  and  some  destination  in  the  ADM  net¬ 
work,  there  exists  only  a  path  between  them;  so  does  in  the  H.^DM  network. 
For  example,  routing  for  a  distance  A  =  0  in  a  HADM  network  of  site  A'  =  16 
has  a  unique  tag  t,.^,  =  0000,  which  represents  a  path  consisting  of  all  straight 
links.  Thus  the  HADM  network  is  not  fault-tolerant.  It  is  interesting  to 
attempt  further  reduction  of  the  network  complexity  while  maintaining  the  con 
nection  between  any  source  and  any  destination.  It  is  shown  in  Theorem  3.1 
that  actually  only  four  output  links  for  each  switch  would  suffice  to  provide 
connection  for  any  source/destination  pair  in  the  HADM  network. 

Consider  a  quad-tree  that  consists  of  logpV  levels  and  N  leaves.  Clearly 
the  out-degree  of  four  for  each  node  in  the  quad-tree  is  the  smallest  out-degree 
such  that  the  root  can  reach  any  leaf;  if  any  node  except  a  leaf  has  an  out- 
degree  less  than  four,  some  leaves  can  not  be  reached  by  the  root.  Similarly,  for 
a  network  of  site  N  that  consists  of  !og,,A'  stages  of  uniform  switches,  at  least 
four  output  links  for  each  switch  are  needed  so  that  any  source  can  communi¬ 
cate  with  any  destination.  Such  a  network  has  the  minimum  number  of  output 
links  for  each  switch  required  for  one-to-one  connections  and  is  called  a 
ASm'mum  ADM  (MADM)  Network.  It  consists  of  n/2  stages  of  4x4  switches. 
Each  switch  of  stage  k-t-1,  0  <  k  <  (r»/2)-  1,  is  connected  to  four  output  links: 
straight  link,  -2'*  and  -t-2‘'*'^'.  Figure  5  illustrates  a  MADM  network  of 

sise  N  =  16. 

The  MADM  and  HADM  networks  differ  only  in  that  each  switch  of  stage  k 
in  the  MADM  network  is  connected  to  only  one  of  the  ^-2'*"^'  and  links 


while  each  switch  of  stage  k  in  the  HADM  network  is  connected  to  both  links. 
So  only  a  subset  of  routing  tags  for  the  HADM  network  are  valid  routing  tags 
for  the  MADM  network.  In  addition  to  the  constraint  that  t  n,,  -  C  for 
0  <  k  <  (n/2)-l,  which  a  routing  tag  for  the  Jf.^DM  network  must  satisfy,  a 
valid  tag  for  the  MADM  network  must  also  satisfy  the  second  constraint  that, 
for  A  .,  +  1  odd,  ,  must  be  1  if  link  +2  ''''  is  used  and  t  must  be  1  if  link 

—  2-*’^'  is  used.  The  se'-ond  const-aint  docs  not  specify  which  of  links  and 

—  2"*''’'  is  used  at  stage  k;  each  stage  can  choose  Ireely  a  plus  or  minus  link.  As 
a  result,  there  are  as  many  as  2'''-  types  of  MADM  network;  they  differ  in  their 
choice  of  link  +2'**'  or  -2  *''  at  some  stage  k.  The  algorithm  MADM-TACS 
below  demonstrates  an  example  of  generating  routing  tags  for  a  particular  type 
of  MADM  network  that  contains  +2‘'''  link  at  every  stage  k, 
0  <  A  <  (n/2)--l.  For  convenience  of  discussion,  this  network  is  referred  to  as 
the  MADM  network. 

Algorithm  -TAGS 

A,,  =  A 

for  i  =0  to  n  - 1  do 

s 

if  A,  is  even  then  t,  ==  0  ,  A..,  —  — - 


if  i  is  even  then  if 


is  even  then  t, 


rise  t.  -  i 


enddo 

The  difference  between  the  processes  of  generating  tags  for  the  HADM  net¬ 
work  and  for  the  MADM  network  is  that,  for  A  j*,  odd,  (  *  +  ,  can  be  !  or  f  for 
the  HADM  network  while  t  j  +  i  can  only  be  1  for  generating  routing  tag  for  the 
MADM  network.  So  each  digit  is  uniquely  determined  in  algorithm  MADM- 
TAGS.  This  indicates  that  there  exists  a  unique  tag  for  each  distinct  A,  which 
corresponds  to  a  unique  path  for  each  source/destination  pair  in  the  MADM 
network. 

Since  there  are  only  four  output  links  for  each  switch  in  the  MADM  net¬ 
work,  two  bits  per  stage  suffice  to  represent  the  choice  of  one  of  the  four  output 
links  of  a  switch  to  send  data.  A  total  of  n  bits  are  needed  to  implement  the 
signed-digit  representations  for  routing  tags.  Let  r,y.,_|  be  such  a  routing  tag, 
in  which  a  digit  can  be  represented  by  a  bit.  Each  switch  at  stage  k  in  the 
MADM  network  examines  bits  r  .^r  to  determine  the  output  link  via  which 


E 


dati  are  routed.  One  possible  implementation  is  shown  below. 


+  2-'*’ 
straight 

+  2-*  for  0  <  ik  <  (n/2)-l 

-2-‘ 

where  — »  means  "en  route". 

However,  for  the  generation  of  tags  in  algorithm  MADM-TAGS,  two  bits 
may  be  needed  to  represent  a  digit  of  the  routing  tag  and  thus  a  total  of  2n 
bits  are  needed.  Once  the  computation  is  done,  the  tag  can  be  converted  to 
for  actual  routing,  which  requires  only  n  bits  per  tag. 

Theorem  .^.1  There  exists  a  unique  path  between  any  source  and  any  destination 
in  the  MADM  network. 

Proof:  It  is  shown  that  a  routing  tag  for  the  HADM  network  that  con¬ 
tains  <_■*<  .*  +  1  =  OT  can  be  recoded  to  become  such  that  jt  +  i  =  01  and 

*  +  ®  ^  J  ^  Case  (i)  If  1.*+.  =  0  such  that 

+  .  =  OTO,  then  t  .t/jt+s  =  OlTt  or  Since  is  odd 

f^o™  Lemma  2.3,  either  Ollt.,*.,,  or  has  =  0. 

Case  (ii)  If  <.■»+.  *  0  then  <-*+3  must  be  equal  to  0,  because  a  tag  for  HADM 
network  must  satisfy  the  constraint  f  =  0.  If  l.t/ji+3  =  OTlO, 

*  '  =  0100,  and  if  1.*/..*+.,  =  OTTO  then  =  OlOT.  The  discussion  for 

recoding  <..*+.<.»+:(  =  OT  is  analogous  to  that  for  recoding  =  OT.  Next 

uniqueness  of  the  routing  path  is  shown.  Since  the  out-degree  of  every  switch  in 
the  MADM  network  is  four  and  there  are  n/2  =  (logAf)/2  =  log,N  stages,  each 
source  switch  and  all  switches  connected  to  it  form  a  quad-tree.  The  source 
switch  is  the  root  and  the  switches  connected  to  it  are  the  nodes  in  the  quad¬ 
tree,  with  the  switches  of  stage  0  as  the  leaves.  There  exists  a  unique  path  from 
a  root  to  a  leaf  in  the  quad-tree  and  thus  also  a  unique  path  from  a  source  to  a 
destination  in  the  MADM  network.  □ 

The  topology  of  the  Mnimum  Inverte  ADM  (MLADMj  network  is  the  same 
as  the  MADM  network,  with  the  input  and  output  sides  reversed,  much  like  the 
relationship  between  the  ADM  and  lADM  networks  and  between  the  HADM  and 
HIADM  networks.  Especially  the  routing  tag  conversion  technique  used  for  the 
HADM  and  HIADM  networks  can  be  readily  applied  and  the  proposed  routing 
scheme  for  the  HADM  network  can  also  be  used  in  the  HIADM  network. 
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00  -* 
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4.  The  Extra  Stage  MADM  Network 

Complexity  of  the  MADM  network  is  minimum  in  the  sense  that,  given  the 
constraint  of  network  site  /V  and  log  ,N  stages  of  uniform  switches,  it  can  pro¬ 
vide  communication  for  any  source/destination  pair  in  the  network  by  usii'.g 
minimum  number  of  interstage  links  per  stage.  However,  thit  kind  of  topology 
has  a  drawback  that  it  does  not  provide  fault-tolerance;  a  switc  h  failure  would 
prevent  some  source/destination  pair?  fruin  communicating  each  olior.  The 
lack  of  fault-tolerance  sugge.'.ls  the  use  of  augmentation  techniques  AdSSl  to 
improve  fault-tolerance  for  the  MADM  network,  hirst  an  important  observa¬ 
tions  for  routing  in  the  MlADM  network  is  made. 

Theorem  4.1  In  the  MADM  network,  the  paths  from  a  source  5.  to  destinations 
D,  (D-e(A'/4)),  (D  — (N/2))  and  (D-(AV4))  are  all  disjoint. 

Proof:  The  proof  shows  only  that  the  two  paths  from  S  to  D  and  from  5  to 
(Z)  — (A'/4))  are  disjoint;  the  other  cases  can  be  treated  similarly.  The  proof  con¬ 
sists  of  two  parts;  (A)  given  the  tag  for  routing  from  S  to  D,  a  lag 

for  routing  from  S  to  (D  — (A'/4))  can  be  derived  from  it  and  they  differ  only  in 
digits  fi— 2  and  n  1,  and  (B)  proof  of  disjoinlness  of  the  two  paths  based  on  the 
results  in  (A). 

(A)  Since  f../,, -1  i*  the  routing  tag  from  ic?  P  {P  >')  = 

(  rt, 2' fl,., 2"-')  mod  2".  So  (D-5'-(  V/4))  - 

(  V  (,  2' -t- ( t „  _  1  )2”“' +  („ _  .2"  " ')  mod  2’.  There  are  thre  -  possible  values,  1,  i 

I  I 

and  0,  for  t„__,  which  are  discussed  in  cases  (i),  (ii)  and  (iii),  respectively,  a.?  fel¬ 
lows.  (i)  If  .,  -  1,  (D-S-(A74))  =  (  Vt,2'^0-2"-  +t„_,2'-  )  mod  2'  .  That 
is.  (ii)  If  t„.  ^  I,  must  be  0  because  0  and 

n  -  ’■ 

(D~S-(N/4))=  (  rj,2' 40-2’'--  +  (t.,.  ,.-l)2’'-')  rru)d  2”.  Then  f'  .  -- 

(  y;«,2' +0-2'’-  -2"-')  mod  2"  =  (  V  t, 2' 4-0  2"  --2"  '■4  2" )  mod  2' 

I  t  «i^’ 

n  —.’t 

(X;t,2'-t-0-2”--  +  2”-')  mod  2".  So  01.  (iii)  If  -  0, 

R  —  S 

(D-5-{N/4))  =  ( mod  2'.  There  are  two  possible 

values,  0  and  1,  for  I„-|,  which  are  discussed  in  cases  (a)  and  (b),  respectively. 
t,_i  can  not  be  T  because  it  is  assumed  that  link  -t-2'’  '  is  used  at  stage  n/2  in 
the  MADM  network,  (a)  If  1,_,  =  0,  <o/.-i  =  tci/n-40-  (*5)  ^t-i  ~  D 

n-e  s-S 

(D-S-(N/i))=  (  X;t,2’-2’’---42"-')  mod  2"  =  (  T  t,2'-(-2"---i-0  2"  ')  mod  2" 

so  that 

(B)  From  (A),  it  is  seen  that  the  two  routing  tags  for  the  two  paths  from  5  to  D 
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(B)  From  (A),  it  is  seen  that  the  two  routing  tags  for  the  two  paths  from  S  to  D 
and  from  S  to  (D-(N/4))  differ  only  in  digits  n— 2  and  n-1;  i.e.  t,  =  i,  for 
0  <  i  <  n— 3.  The  two  tags  are  the  unique  tags  for  routing  from  S  to  D  and 
from  S  to  (D—(N/i)),  respectively.  Let  F  and  F"  be  the  two  switches  at  stage 
(n/2)— 1  on  the  paths  from  S  to  D  and  from  S  to  [D  —  [N/4)),  respectively. 

Since  t,  =  ,  for  0  <  i  <  n-3,  (i.e.  the  distances  that  the 

laNl 

two  paths  traverse  from  stage  (n/2)— 1  to  0  are  the  same),  and  the  distance 
between  the  two  destinations  D  and  (D—{N/4))  is  ((N/4)  mod  N)\  hence  the 
dutance  between  F  and  F*  must  be  also  ((N/4)  mod  N),  denoted 
\F--F'\  =  (N/4)  mod  N.  The  intermediary  switches  at  stage  k, 
0<k<  (n/2)-2,  on  the  two  paths  are  F +6^  and  F'+b^,  respectively,  where 

(to, 2‘' +to,.n2"''*'').  But  (F+i*)  ^  (F^+6t)  mod  2"  because 

i 

|F— F^l  =  (N/4)  modt  N.  That  is,  the  two  paths  never  share  a  common 
intermediary  switch  and  thus  are  disjoint.  □ 

The  identification  of  disjoint  paths  from  a  source  to  different  destinations 
in  Theorem  4.1  can  be  used  to  improve  fault-tolerance  for  the  MADM  network. 
The  technique  is  to  add  an  extra  stage  to  the  MADM  network.  The  extra  stage 
can  be  placed  in  the  output  side  of  the  MADM  network  such  that  each  switch 
D  at  the  extra  stage  is  connected  to  four  switches  at  the  first  stage  of  the 
MADM  network:  D,  (D-f(N/4)),  (D-(N/2))  and  (i?-(N/4)).  Data  can  he  sent 
from  source  S  to  any  of  the  four  switches  and  then  to  the  destination  via  the 
extra  stage.  Thus  there  exist  four  disjoint  paths  from  any  source  to  any  desti¬ 
nation  in  the  extra  stage  network.  Such  a  network  with  an  extra  stage  in  the 
output  side  of  the  MADM  network  is  called  an  extra  stage  MADM  network.  An 
extra  stage  MADM  network  consists  of  (n/2)-fl  stages  labeled  from  0  to  n/2 
from  the  output  side  to  the  input  side,  with  an  additional  column  of  switches  in 
the  input  side  referred  to  as  stage  (n/2)+l.  The  extra  stage  in  the  extra  stage 
MADM  network  consists  of  the  switches  of  stage  0  and  the  input  links  of  the 
switches.  The  topology  of  the  extra  stage  MADM  network  from  stage  1  to 
(n/2)+l  is  the  same  as  that  of  the  MADM  network  from  stage  0  to  n/2.  The 
extra  stage  MADM  network  is  three-fault-tolerant  because  of  the  existence  of 
four  disjoint  paths  for  every  source/destination  pair  and  thus  can  withstand  at 
least  three  switch  failures  (except  the  input  and  output  switches).  Since  each 
destination  in  the  MADM  network  has  at  four  input  links,  which  are  connected 
to  four  switches  in  the  preceding  stage,  at  most  three-fanlt-tolerance  is  possible. 
By  appending  an  extra  stage  to  the  MADM  network,  the  optimal  fault  tolerance 
is  achieved. 

Since  the  four  output  links  of  a  switch  at  the  extra  stage  are  straight  link 
and  Unks  -2"'=  (=  N/4),  +2"-'’  (=  N/4)  and  -2"-'  (=  -N/2),  the  extra  stage 


(n/2)4l  to  stage  1  in  the  extra  stage  MAl)M  network  Using  the  tags  of  dis¬ 
tances  :^  =  {D  S),  -A  =  (/>-S-(,\74)),  a  =  (D-^.'-(^72)),  and 
A  =  (D  -  S  +  (i\ /4)),  respectively,  a  source  5  (a  switch  at  stage  (n/2)-(-l)  in  the 
extra  stage  MADM  network  can  send  data  to  any  of  the  four  switches  D, 
{D-r(N /4)),  (D  -(A74))  and  (D-  (S /'!))  at  stage  1,  and  then  reaches  the  desti¬ 
nation  D  at  stage  0.  The  routing  fioni  U'  to  D'  is  controlled  by  tag  bits  00, 
from  (Z)-t(,N74))  to  D'\  by  lO.  from  (P  (A'/2))  to  D",  by  01,  and  from 
(P-(A74))'  to  IJ  ,  by  10.  So  in  the  extra  stage  MiDM  network,  n4  2  bits  are 
needed  to  represent  a  routing  tag.  Note  that  since  the  four  tags  of  distances 
A  -  (D-5),  A  --  (D-5 -(A’,4)).  .1  (P  -  5- (.\72)),  and  A  =  (P  -  .«•  +  ( A'/ 1)) 
differ  only  in  digits  n-2  and  rj-l,  once  one  of  theni  is  con.puted,  the  other  can 
be  readily  computed  by  recoding  the  last  two  digits.  The  proof  of  Theorem  4.1 
demonstrates  the  example  of  recoding  the  lag  of  distance  A  -  (D  S)  to  a  tag 
of  distance  A  -  (P -S-(A'/4)).  The  table  below  summarires  the  recoding  of 
digits  of  a  tag  into  the  other  three  tags  that  are  of  distance  sA  '1, 

-  A'/4  and  A’/2  from  it. 


-t-A'/4 

-A'/4 

-  A'/2 

00 

! 

TO 

01 

01 

TO 

10 

00 

10 

1  01 

00 

10 

To 

'  00 

01 

10 

Figure  6  illustrates  an  extra  stage  MADM  network  of  sire  A  -  !6  It  i> 
also  shown  the  four  disjoint  paths  from  5  3  to  P  -  1 2  They  ar- 
represented  by  the  tags  of  distances  A  -  9,  A  -  5,  .it  =  1  and  A  ^  3  which 
are  (by  =  )1001,  1010,  1000  and  lOTO.  respectively  The  routing  paths  are 
(3',11',12',12  ),  (3  ,7-,8',12  ).  (3  ,3-,4  ,12')  and  (3  ,ir.-,C  ,12  I,  respectively.  Rout¬ 
ing  from  12'  to  12  is  controlled  by  tag  bits  00,  from  8  tci  12  ,  by  10,  from  4  to 
12  ',  by  01,  and  from  O'  to  12'  ,  by  TO. 

It  can  be  similarly  shown  that  an  extra  stage  can  also  be  appended  in  the 
input  side  of  the  MADM  network  such  that  a  switch  S  at  the  extra  stage  is  con¬ 
nected  to  four  switches  at  stage  n/2  of  the  MADM  network:  5,  (S-sl),  (S-1) 
and  (S-p2).  Four  disjoint  paths  result  from  addition  of  such  an  extra  stage  to 
the  MADM  network.  In  this  type  of  extra  stage  network,  the  extra  stage  con¬ 
sists  of  the  switches  of  stage  (n/2)i  I  and  the  output  links  of  the  switches,  and 
stage  n/2  to  stage  0  has  the  same  topology  as  the  MADM  network.  The  extra 
stage  appended  in  the  input  side  has  the  same  connection  patterns  as  stage  1  of 
the  MADM  network.  A  source  5  at  the  extra  stage  can  send  data  to  any  of  the 
four  switches  at  stage  n/2'.  F,  (.‘'  +  1),  (5-1)  and  (.‘'  •2)  that  are  directly  con¬ 
nected  to  it  and  uses  tags  of  distances  A  =  [D-S),  A  -  (Pil-.‘^), 
A  =  (P-I-5)  and  A  -  (P-2-5),  respectively,  to  send  data  tcc  the  de.stinations 


D  at  stage  0.  The  routing  from  ‘,'"'-'*1  to  5"''-  is  controlled  by  tag  bits  00. 
from  5  to  (i'-fl)"'"  ,  by  10,  from  5'"''-'^'  to  (.‘'-I)’'''-,  by  10,  and  from 

S'"/-'-"'  to  (5+2)'/',  by  01. 

Apparently  adding  an  extra  stage  to  the  input  side  of  an  MADM  network 
is  equivalent  to  adding  the  extra  stage  to  the  output  side  of  the  MlADM  net¬ 
work  and  vice  versa.  Thus  ail  discussions  associated  with  the  relationship 
between  the  MADM  and  MLADM  networks  can  be  applied  for  the  extra  stage 
networks  as  well. 

5.  Conclusion 

This  paper  addresses  the  problem  of  designing  multistage  networks  which 
are  based  on  the  implementation  of  PM21  functions  at  each  stage  This  type  of 
multistage  networks  is  referred  to  as  augmented  data  manipulator  iietwoiks 
and  includes  the  well  known  ADM  and  lADM  networks.  Since  the  design^  pU'- 
posed  in  this  paper  use  fewer  stages  and  links  than  the  AJ)M  and  LVDM  rei- 
works,  they  are  referred  to  as  p.artially  augmented  data  manipulator  networks. 
The  HAD.M  and  HIADM  networks  derived  in  this  paper  have  the  least  number 
of  stages  required  in  multistage  networks  (based  on  PM2]  functions)  where  any 
source  can  be  connected  to  any  destination  in  one  pass.  The  MAi'.M  and 
MLAD.M  networks  also  have  the  least  number  of  stages  and,  in  addition,  have 
the  minimum  number  of  links  per  switch  required  for  one-to-one  connections. 
The  extra  stage  MADM  and  MIAD.M  networks  contain  one  more  stage  thaii  the 
MADM  and  MLADM  networks,  respectively,  and  are  fault-tolerant  vensions  rf 
the  MADM  network  capable  of  tolerating  at  least  three  switch  faults. 
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I'his  paper  proposes  a  yvci 'ma.ic  approacn  ’o  the  de-.'jge,  multistage  ;fite,-<'(;.n.ne;:!  uiii  uit'.vorv^'  'ii''.  >  u  <■  .'i- 

troiled  by  destination  tag  routi.'ig.  Iloth  uni'iue-patii  and  fauit-tolerant  nelworxs  ar“  consule-.  u  !  u":"  ■  ■  '  ■■  .'.■.i'  '. 

on  the  ob.servalion  tital  sncii  aef.vorKS  can  ite  uerived  by  coaii'seing  'liultifde  tree.?,  .'evei-al  .'■■e.?;'..ii"i.c  '  ur  o'  ..  ■:  ■  p-  '  ■ 

dent  'ie'.igns  are  found  lo  i,*e  govf^rtP'U  !‘'y  'lii?  approact;;  tiii.:  finii'.'.g  eads  '.o  morr'  ejiu'l'uit  ro  itpig  .r;  i  :'j  t  ’  "p  ' 
and  '.'tdter  fa uit-toieran '.  .! oEviiogie-j  for  them. 

Dnritig  the  oast  decaue.  .a  pleiliora  ■,(  intorconnect  ion  m",'. corks  have  bi.'cii  firoposed  I’etuajC'  tiie  ;  'i;i  i .- 

of  !nulti.‘'t.age  networks  is  .dial  of  multistage  -lelworks  siien  ;is  the  Imlirect  liiuary  n-i.'ube  ' 

Daseline'’,  .Modilled  ADM*'.  Hi,>‘  rimi  a  ■^peeiai  .-ace  -u'  aW-I>,any:ui'’  nef.vorks.  .\mong  tlie  '.di  i  ..r  ■  ; 

’.hese  networks  are  their  very  ('.dicii.i.t.  liestir;.-'. '. i'vn  t.ag  .'fuiting  sriierncs,  partit ninahii'.ty,  Ol.Vi.jgo'.  '"U  :i;d 
pass  useful  pernnU.atious. 

[ti  .a  ''uize-type  netv.orK,  the  l.'lnary  reiire-eniaticuj  of  desiuiation  address  .’an  .u'  ised  lire-''.!'.'  as  a  .-ouit.i':'.: 
riie  interchange  box  at  stage  -.  needs  only  t^i  exani'iic  Mie  :-'ti  Ml  ot  tb.e  ''re.stinaL'.on  rouiie't  ‘ag  pt  s.;  mcotiUng  t:'.'':'sag‘'. 
if  the  i-t/t  bit  i.s  P.  the  up,'>r‘r  output  of  *he  box  ii:  t.aken  .irid  .f  the  i-bh  i,  the  i(;cver  output  o!  t'le  box  -  La '0  11. 

I  hese  sclieme.s  .are  kno’.vn  as  de.stiriniion  tag  roafinz;  .sc/icmc.5‘ ■*  ami  .are  e.\*trctncly  eiliciont  ami  smipic  'o  !m;d'"iiciil . 

Due  lo  the  requirement  for  eindcnt  routing  of  dat.a  in  interconnection  networks  u.sed  iu  real-lime  .ipplii  atioiis.  a  uec- 
’ifiaiion  lag  routing  .scheme  is  a  'iesirabie  feature  in  m’erconr.eclion  networks.  It  is  ititero't.'ig  to  li'  t lu  r  or  ".u' 

■  tiere  exist  rietwork  topoiogi''?  that  are  not  neccs.c.arily  efpdv.alent  to  *ne  'tube-type  net. c*irks  and  cai'  ■  'o  'on* c't  .y 
■if  ‘‘ tin.a  ti< ui  ‘  ags.  d'lie  mot  ivat luui  of  Id?  paper  ' ■.'  ‘im!  .a  .".■.ste.m.ai  je  appro.acti  i o  ctuict  rue  t  m  - '  ’.vir  ;-m  "i  i p".  1  u  -• 
'I'ferred  'o  as  (testinalion-t'Kj-i'Oj'.trnlli’d  ucfu’orks  iPW:. 

Let  .uiy  input  swit.  h  in  the  lirst  .stage  of  a  network  be  calleil  a  -ourre,  and  an  oui'put  p,\:lc;ii  m  ■  ue  bi'-t  stage  o',  ,. 
'  ■  ’'.v.trK  be  ''.ailed  a  destiiialion.  Input  links  of  a  '''ouree  .and  'Hit(?ut  link.?  ol  a  lestinati.?."  'ire  ralii  d  .U'ruu'sil  /.'U.v.'.. 
■  ' jie  networks  have  the  properly  that  a  unique  routing  path  exists  b.-t.v.eon  any  sour*-''  atni  .uiy  !"s' lu.ai  o-i:  '.s  a 
'  u  lepc",  .'I  single  .s'-viteti  fail  ire  eiimin.ates  tlm  possiiuiity  of  comnnmic.alion  netween  some  ou ree  , 'Ic-.*  in ,a'. loi;  p-i.r  i  " 
'  '  ,  s  proidern.  a  iionular  approach  is  to  provitie  extr.a  intersl.age  -inks  to  an  exi-iiing  uniqm'-p.atb  t  'p'.dogy  a  oid'  r 
’■  .  td.'ional  paths  and  thus  provide  an  ability  to  tolerate  faults''"-.  In  fact,  many  ni-twork.s  .vith  .-(dundaiil  pa'b 

'  air,"'!  as  uni'pie-path  net'.v’orks  lo  whicii  extra  links  have  been  ao'ded.  l''or  exarnpi'e  d  lias  been  siiewu  ’ba' 

'.  e'-.cnrK'*  i-ar  be  regarded  as  .a  faull-toleranl  vers. on  of  the  Kbibe  network  where  -uie  extra  link  is  aUded  'o 
f  'tie  If  me  nef.vork^''*'"®,  .\n  ICibe  network  is  shown  in  Figure  I  and  a  (Jamma  network  ami  its  rel.alioii- 
I  ..  ■■'■.cork  s  -tliown  in  Figure  'J.  The  C.amm.a  network  lias  three  output  links  for  each  switch  .and 

uie  ■  o'l rre,'desi inati'zn  pairs.  However,  the  <I.amm.a  net'.vo'-k  may  not,  be  able  'o  survive  some 


instances  of  a  single  switch  failure;  i.e.,  the  communication  for  some  source/destination  pairs  may  be  eliminated  due  to  a 
switch  failure.  To  achieve  one-fault-tolerance  for  the  Gamma  network,  yet  another  extra  link  is  added  to  every  switch  of 
the  Gamma  network,  which  results  in  the  Kappa  network*.  This  is  equivalent  to  adding  two  extra  links  to  every  switch 
of  the  ICube  network  and  the  Kappa  network  can  tolerate  at  least  one  switch  failure.  A  Kappa  network  and  its  relation¬ 
ship  with  the  Gamma  network  is  shown  in  Figure  3. 

This  paper  is  concerned  with  the  construction  of  DN's  that  h.ave  multiple  paths  for  each  source/destination  pair, 
called  fault-tolerant  DN's  (FDi\  s),  as  well  as  unique-path  DN's  (XJDN’sj.  In  particular,  it  uniBes  the  principles  th<at 
underlie  the  construction  of  the  ICube,  Gamma  .and  Kapp.a  networks  and  shows  that  a  plethora  of  other  DN  topologies 
c.an  also  result  from  adding  extra  links  for  each  switch  of  the  ICube  .and  Gamma  networks,  respectively. 

.Notation  used  in  this  paper  is  delined  next.  Any  integer  t  h.as  a  binary  representation  '  No¬ 

where  is  the  most  signilic.ant  bit  and  n  denotes  the  number  of  bits.  The  not.ation  denotes  tlie  bits  of  t  st.arting 
at  tp  and  ending  at  t^.  lo  indicate  the  L  s  compleineul  of  bit  u,,  the  not.ation  u,  is  used.  Throughout  this  paper,  u  and 
u-rfi  (where  a  is  some  constant)  represent  labels  of  switches.  .Also  modulo  N  arithmetic  is  assumed,  e.g.  u -fa  implies 
(u-t-a)  mod  N.  The  notation  u  is  used  to  indicate  that  a  switch  u  belongs  to  stage  i  and  (ii'  ,  u'^*)  is  used  to  represent 
a  link  at  stage  !  joining  u  and  u  I  he  out-degree  of  a  switch  is  the  number  of  output  links  of  the  switch  .and  the  m- 
degree,  the  number  of  input  '.i.uks. 

Section  2  of  the  paper  examines  the  structure  of  the  cube-type  networks  and  proposes  binary  tree  structures  for  con¬ 
structing  DN's.  D-constructs  are  presented  in  Section  3  as  building  blocks  of  binary  tree  structures  and  UDN's  are  con¬ 
structed  by  using  the  D-constructs.  Because  UDN’s  are  not  fault-tolerant,  improved  topologies  have  been  proposed  in  the 
past  which,  by  adding  one  extra  link  per  switch,  provide  some  redundant  paths.  These  and  other  similar  networks  are 
referred  lo  as  enhanced  DN’s  iEDN’s)  and  are  considered  in  Section  4.  However,  EDN’s  may  still  fail  to  tolerate  a  single 
fault  in  a  switch  or  link.  In  Section  '>  fault-tolerant  D-constructs  are  proposed  to  construct  FDN's  that  are  capable  of 
tolerating  at  least  one  switch  failure.  .Merits  in  routing  and  rerouting  schemes  and  fault-tolerance  advantages  are  also 
discussed  for  UDN’s,  EDN’s  and  FDN’s,  respectively.  Section  6  concludes  the  paper. 

2.  THE  DINABY  AND  FAUUT-TOUERANT  TREE  STRUCTURES 

'i'his  spciiou  examines  ilie  structure  of  the  c>ibe-lype  networks,  which  are  known  L'DN’s,  in  order  to  derive  anci  nra- 
vide  in.Highl.s  into  a  systematic  approach  to  construct  DN’s. 

TliO  structures  of  all  cuoe-type  networks  consist  of  n  =  logjV  stages  of  N/2  2x2  switches.  Because  there  are  n 
stages  of  switches  .and  each  switch  is  connected  to  two  switches  at  the  next  stage,  for  each  source  there  exists  a  binarv 
Tee  that  contains  the  source  as  the  root  of  the  tree  and  the  switches  reachable  from  the  root  as  the  nodes  of  the  tree''. 
•A  switch  is  said  to  be  reachable  from  the  other  if  there  exists  a  path  between  them.  The  N  =  2”  output  switches  at  the 
last  stage  arc  the  leaves  shared  by  all  !)inary  trees.  For  example.  Figure  1  shows  a  binary  tree  in  an  ICube  network.  It 
includes  the  source  00  as  llie  root,  switches  00  and  01  of  st.age  1,  and  all  switches  of  the  hast  stage  as  the  nodes  of  the 
binary  tree.  The  ICube  network  can  be  regarded  the  coalescing  or  partial  overlapping  of  N  binary  trees,  each  of  which 
has  a  distinct  root  (source)  and  may  or  may  not  share  with  other  trees  switches  of  stage  i,  0  <  i  <  ri— I;  for  i  =  u— 1, 
■-he  N  destinations  are  the  leaves  shared  by  all  N  binary  trees. 

The  binary  tree  structure  was  used  lo  assign  labels  to  ‘''rminal  links.''  The  label  can  be  obtained  by  assigning  a 
latjcl  0  for  the  upper  output  link  and  1  for  the  lower  output  link  and  concatenating  the  labels  along  the  path  from  a  ter¬ 
minal  input  link  lo  a  lerminai  output  link.  L'his  is  .also  illustrated  in  Figure  1. 

.Notice  ib.at  in  the  ICube  network  a  source  has  two  terminal  input  links  and  a  destination  has  two  terminal  output 
'inks.  In  the  ICube  network,  any  arbitrary  lerminai  input  link  can  be  connected  to  any  arbitrary  terminal  output  link 
because  there  exists  a  path  '.hat  connects  tlie  switch  (source)  of  the  terminal  input  link  and  the  switch  (destination)  of 
llic  terminal  output  link.  As  far  as  one-to-one  routing  is  concerned,  only  the  interstage  connection  patterns  alfcct  the 
connection  lielween  a  source  and  a  destination  and  the  number  of  terminal  links  of  the  source  and  the  destination  is 
unimportant.  This  allows  this  work  to  concentrate  on  the  study  of  the  interstage  connection  patterns  and  sim|)liHc3  the 
discussions  for  the  construction  of  DN  topologies. 

The  above  observations  suggest  the  following  suiricient  conditions  for  a  network  of  size  NxN  to  be  a  DN:  (a]  there 
exist  at  least  N  embedded  binary  trees  and  at  least  one  such  tree  is  rooted  at  eacli  input  switch  of  the  network  and  has 
the  N  output  switches  as  its  leaves,  and  (b)  each  of  the  two  output  links  of  a  node  of  the  binary  tree  is  assigned  a  label  0 
or  i  and  the  unique  address  of  any  destination  of  the  network  is  formed  by  concatenating  the  labels  along  the  path  from 
'he  root  lo  the  destination,  '['he  DN  may  have  different  number  of  switches  at  each  stage,  switches  of  different  stages  lio 
rsot  necessarily  have  the  same  size,  and  even  the  switches  of  the  same  stage  can  be  of  varied  sizes.  .A  bin.ary  tree  has 
log.V  levels  and  thus  the  DN  has  logjV  stages  (the  basic  idea  is  also  .applicable  lo  networks  with  arbitrary  number  of 
stages  but  this  case  is  out  of  the  scope  of  this  paper.) 

1  he  notion  of  block  structure"  was  used  to  describe  the  topologies  of  the  cube-type  networks  and  can  be  regarded  as 
a  subset  of  the  binary  tree  structures.  The  concept  of  the  block  structure  can  be  best  explained  by  the  illustration  of  the 


structure  of  .1  Baseline  network  in  Figure  ■4.  The  entire  network  is  regarded  as  an  ;Vy.V  block  and  then  it  is  divideo  into 
a  stage  and  two  subblocks.  The  stage  Is  the  first  stage  of  tlie  block  and  the  remaining  stages  arc  divided  into  two  snli- 
blocks.  The  process  is  repeated  until  the  last  stage  is  reached  where  a  block  consists  of  a  single  switch.  Because  a  swilc); 
of  the  first  stage  of  a  block  is  connected  to  two  switches  of  the  first  stage  of  the  two  subblocks,  an  input  switch  and  all 
switches  reachable  from  it  constitute  a  binary  tree.  In  other  words,  the  block  structure  can  be  formed  by  properly 
coalescing  binary  trees.  However,  many  possible  structure,s  which  are  not  ncces..arily  a  block  structure  can  res  ill  froiii 
coalescing  multiple  binary  trees.  That  is,  the  block  structure  employs  more  restricted  connections  for  constructing  net¬ 
works  than  those  resulting  from  coalescing  multiple  binary  trees. 

One  of  the  important  criteria  in  choosing  a  switcii  size  is  network  mo<iularity.  A  module  is  a  i'uilding  'nlo;!-;  i.i 
work  that  can  be  formed  by  cascading  these  building  blocks.  In  order  to  reduce  design  and  rnanufaciuring  con.c,  Ir 
desirable  that  a  network  can  be  modularly  constructed  and  the  number  and  size  of  modules  siiouid  lie  tiiiniiiiize'! .  .V  uni¬ 
form  switch  size  for  the  network  (.and  thus  the  satne  nuinher  of  swKc.hes  for  every  'Jtage)  can  facilit.ate  modiiiar  design. 
For  this  reason,  onlv  DN's  of  size  that  consist  of  uniform  switches  .are  c-msidcred  in  this  n.iner. 


If  the  number  of  output  iiak.s  of  every  iwitch  ol  a  ITN  is  two,  then  ‘■'.ere  exisl.s  a  uiii'|iie  p.alh  in  ilu'  1).\  for  < 
sourcc/destination  pair  because  there  exists  a  unique  binary  tree  of  which  the  source  is  the  root  and  there  evs.ts  a  11 
path  from  a  root  to  any  leaf  in  the  tree.  ouch,  a  D.M  is  a  HON.  Tiie  inability  of  the  l.'DN  to  tolerate  f.ault.s  i'fjr  'In  ' 


or  every 

i  .11'. iipie 

In  rjas; 
nv  *i‘orn 


coalescing  iV  such  fault-tolerant  trees.  Because  modular  designs  are  sought,  only  L'.N's  ivit 
are  considered.  Merits  and  advantages  of  these  designs  with  respect  to  f.ault-ioicr.ance  .and 
are  also  discussed  in  later  sections. 

fi.  riilt  U-CfO.NSTRL'C  T 

The  interstage  connection  patterns  of  ,a  PN  .are  represented  by  labeling  of  intern.ediar 
extremes  of  a  DN  are  the  source  and  ucstinatioti  '.vhicii  iiave  addresses  and 

relayed  via  inlortneiiiary  s'.vitclu"'.  dabeiir.g  >f  (iicse  .s'.vilrhvs  rellects  routing  as  a  progressn 
source  address  hits  to  a  .s-lriug  u'  ail  dcsti, nation  d'.'re.'-'.s  i’it.c.  With  tliis  in  mind.  'i;i';  sectio 
a  ejass  of  I'P.N';  oa.  cd  on  uin.arv  !  rec  'Ur'ict'.i.''  'u 


!.et  the  sequenc"  lA'r,  ,  k,  .  .  ■  .'.’--p  he  .a  pcriniitation  f  the  « onence  iil.l. 

!  =  0,1 . ri  — 1.  Detlne  a  ronnvetton  function  P'  mapping  ,a  switcii  u'  •<'  a  .switch  /)' .'n' ..A  :  ' '  ‘  n 


p  for  .'i,  —  0 
</  for  It  =  1 


where  pi;,  =  ,  0  <  j  <  1.  and  Pi.  ~  0,  (j/.  —  1, 

and  the  values  for  pi.  and  i/j,  .  .'  -f-l  A  j  A  n— 1.  are  determined  according  to  u'  and  (j.  and  some  icsign  cni.cn.i.  i'or 
example,  if  k,  =  i.  0  <  i  <  n  — I,  and  pj.  =  7).  =  t/j  ,  i-kl  -A  j  a  n— I,  the  mapping  function  !)'  is: 

1  “n-t/i..  idw,  _iyo  for  ^  ~ 

D  (u  ,1.)  -  for  =  1 


l“n-i/.Ap“.-i,to  lor =  I 

■  Vs  a  more  complicated  example,  if  k,  =  n— i— 1,  0  <i  <  ri  — 1,  and  ==  7^  =  u*  ,  t+l  A  j  a  n  — 1,  iheu  the  nianpiug 
function  D'  is: 

"-.-l/n  i*^''n-i-Z/0  for  In -1-1 
D'fn’J  f)--  ^ 

'‘n-l/n  -  1 

I'hosc  two  examples  correspoud  to  the  mapping  fniiclions  for  Mie  K.uhe  aii'i  tlie  modified  .\P.M  nelworks,  ■  ivr  i;, .  ■  , 

explained  later  in  this  section. 


For  a  pair  of  switches  joined  by  a  link,  the  switch  at  the  lower-order  stage  is  called  n  prrdrce.'i'ior  ami  the  s'.vi'ch  .at 
the  higher-order  stage  is  called  a  .succe.s.sor.  The  definition  of  D'  indicates  that,  for  each  switcii  n',  'here  exist  'wo  suc¬ 
cessors  at  stage  i-t-l,  and  7'*^',  and  labels  of  u',  p’  *^'  and  7’’*  agree  in  bits  k^,  k,  ,  .  .  .  .  and  k,  I'lio  f-vo  oiitpu' 

links  fu'.p'’^')  and  (u',7''‘)  arc  called  a  0-Unk  and  a  l-link.  respectively,  and  switches  p'"*'  and  are  iTiilrd  a  u 

-ucr.essor  and  a  1-xucccssor,  respectively.  I’his  naming  rellects  the  fact  that  the  k,-i/i  liil  of  the  laiiel  of  the  s-.vilrh  it 
stage  i-i-l  connected  to  u'  is  0  (or  1)  if  messages  are  sent  via  a  0-link  (or  1-link).  Note  that  a  mapping  func'aon  O' 
delines  only  the  the  connections  between  a  switch  of  .stage  f  .and  two  s'.vilclies  of  .stage  i+l  and  switches  of  :  no  'uime 
stage  may  have  dilTercnt  mapping  functions. 

The  set  of  functions  {O',  0  1  <  n— 1},  henceforth  called  the  D-co^istruct,  specifics  connection  patterns  iiel'.veen 


v.v^,,Vvxv>.yyy -/.v;:.  ' ^v^.  v.v; . ; 


le  DN  s  ri.t-i  ’.'<1 

■ 

'■-Wr 

D-rn  3  •  .'I  or  i '  \  iliM  ?s 

^  M'.fl  roroiri;;!.!  ‘  linri;'’; 

• 

:  hi'f?  of  *  lip  1 )  N .  , .  '  ■■■•  ■ 

'■.livciv.  .x.f  't'l.'l  t 

‘A 

■'.Tsio.'i  'i  '  .1  '?-•  r  .i,; 
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adjacent  stages  i  and  «+l  and  therefore  can  be  used  as  building  blocks  to  construct  networks.  The  deflnition  of  the  D- 
construct  specifies  the  number  of  successors  for  every  switch  to  be  two  and  the  constraint  of  network  modularity  deter¬ 
mines  a  uniform  switch  size  for  the  network.  Therefore,  a  switch  of  the  network  has  exactly  two  input  and  two  output 
links  and  all  stages  have  the  same  number  of  switches.  Routing  in  the  network  is  such  that  a  switch  at  stage  i  needs  to 
examine  the  kj-th  bit  of  the  routing  tag  to  determine  through  which  output  link  messages  are  sent.  If  the  bit  is  0,  the 
message  is  sent  via  the  0-link  and  if  it  is  1,  the  message  is  sent  via  the  l-link.  An  interesting  relationship  between  any 
routing  tag  and  the  labels  of  the  switches  on  the  routing  path  specified  by  the  routing  tag  is  stated  next. 

Property  3.1  In  a  network  constructed  by  using  the  D-construct,  let  be  a  routing  tag  and  u'*  be  a  switch  on  the 

routing  path  specified  by  then  Vf.  =  4,1  ^  J  Si  ’■ 

Proof;  Let  .5,_iyo  be  the  source  of  the  path  specified  by  I'X  definition  of  the  D-construct, 

=  4,  ,  (D‘(^VdO,4,)k  =  ■ 

ID'(  ■  ■  ■  ■  ■  dO)*,  =  («■  =  4.  .  Tor  n<J 

a 

Property  3.1  indicates  that  bits  /t,,,  I;,,  ...,  k,  of  the  labels  of  the  switches  at  stage  i-fl  of  the  routing  path  specified 
by  a  routing  tag  are  equal  to  the  corresponding  bits  of  the  routing  tag.  By  specifying  the  routing  tag  =  ‘^n-i/Oi  the 

label  of  a  switch  0  <i  <  n  — 1,  on  the  routing  path  has  u,.  =  ,  0  <  j  <  i— 1,  and  at  tlie  last  stage 

(i4-l  =  the  destination  of  the  path  has  the  address  equal  to  d„_,yo.  This  is  true  regardless  of  the  address  of  the 
source  sending  the  message.  Thus  .i  network  constructed  by  using  the  D-construct  is  a  DN.  In  addition,  because  the 
out-degree  of  every  switch  of  the  DN  is  two,  there  exists  a  unique  path  from  any  source  to  any  destination;  so  the  net¬ 
work  is  also  a  UDN. 

It  is  impiicit  in  the  construction  of  She  IJDN  using  the  D-construct  that  for  each  source  of  the  UDN,  a  binary  tree 

■otiuiosi'd  of  ‘he  source  ami  all  switches  reachable  from  the  source  is  cmbeddc<l  in  the  UDN.  To  show  that  a  binary  tree 

exists  for  c,.ir,:i  source,  ji  needs  to  be  shown  that  at  stage  t.  I) '  t  ■  n,  all  the  2'  switclies  reachable  from  the  same  source 
are  iiistmet  iwitches.  I'Tom  the  .Iclinition  of  the  D-construct,  tlie  two  successors  (at  stage  1)  of  the  source  have  their 

bit  e(|uai  to  0  and  I,  respectively;  so  these  two  switches  must  be  distinct  switches.  By  I’roperty  3.1,  tlie  four 
switidies  at  stage  2  rcachabic  from  tlie  two  successors  must  liave  the  bits  of  their  label  in  positions  I'd  and  U'l  equ.al  to  00. 
01,  10.  ami  11.  respectively;  so  tlicy  are  distinct  switclies.  By  simple  inductive  reasoning,  at  stage  1,  tlie  labi'Is  of  all 

switelies  reaciiable  from  the  successors  at  stage  i-l  must  dilTer  in  at  least  one  bit  in  bits  k„,  .  or  A',  n.  Hence  ail 

switches  at  stage  1  reaciiable  from  the  source  are  distinct  switches. 

The  following  discii.ssions  demonstrate  the  construction  of  a  UDN  based  on  the  D-construct.  It  is  sliown  liow  to  con¬ 
struct  interstage  connection  patterns  between  two  adjacent  stages;  by  repeating  tlie  same  procedure  for  every  stage,  a 

UDN  results.  Let  .1'  be  a  subset  of  switches  of  stage  i  such  that  hits  k^,  k, .  k, of  the  label  of  every  switcli  in  the 

same  subset  have  the  same  value  and  /?r)(u')  and  /d,(u‘)  be  subsets  of  switches  of  stage  i +1  defined  as  follows: 

/dd(u' )  =  { u' * '  I  e*  —  Ui:  .  'ii'd  =  0  and  u'G.-l'J 

/l,(ii'>  =  ~  <‘k  '  d  <  ;  <  i— 1,  ami  U)k  =  1  and 

liv  deliiiitioiis  of  f)g(ii')  ami  /d|(u'),  they  are  the  subsets  of  possible  successors  for  u'  and  can  lie  connected  to  u'  tliroiigli 
(l-liiiks  ami  1-liiiks,  respectively.  Since  we  always  refer  to  (.1 '  ,/Jo(u' )  J  or  j.l '  .Zi  ,(u ' )  J  for  which  u  is  an  element  of  ,l', 
iiolation  .1,  Bg  and  13^  are  used  lienccfortli  to  represent  .1',  Bg(u')  and  /?,(u'),  respectively.  For  a  given  value  of  1, 
0  1  L  u— 1.  lliere  are  '2‘  subsets  of  .1,  cacli  subset  consisting  of  •j'*"'  switclies;  ami  'J‘  siiliscts  of  Bg  ami  2'  siilisets  of 
/J|,  e.ndi  subset  ronsi.sting  of  •j'’  ’"'  switclies,  A  comiection  pattern  between  .1  ami  Bg  (or  between  .1  and  B,)  is  a  set  of 

O-links  (or  l-liiiks),  each  of  the  l)-liiiks  (or  1-liiiks)  joining  a  switeii  of  ,i  ami  a  .switch  of  /tg  (or  a  swili-li  of  .1  and  a  swileli 

ol'  Itf)  Siieii  ilial  llie  iii-ilegree  of  every  switcli  of  /(„  is  two  ami  one  link  leaves  from  every  switch  of  .1,  Since  .1  is  con¬ 
nected  to  bulli  Bg  ami  B,,  tlie  out-degree  of  every  switch  of  .1  is  two  in  tlie  rcsiiiting  connection  [lattcrn.  Note  lliat  con¬ 
nection  patterns  of  all  2‘  pairs  of  (/i,Bg[  ami  (.l./l,]  ‘-.an  be  found  independently.  Tlie  following  .algoritliiii  Is  used  to 
find  connection  patterns  for  (,t,Bg|.  Connection  pattern.s  f<ir  {A,Hy\  can  be  found  similarly. 

step  0:  in-dogree  (e)  =  d  for  all  e€/Io. 

step  1:  If  ,l  =  done. 

step  2:  Uonnect  iiG-i  to  iC/lg  and  in-ilegrce(i) )  =  iM-<lcgree(ii)+ 1. 

step  3:  ,l  =  .1  —  (u  1  ami  if  in-degree(e)  =  2.  /)„  =  Bg  -  {u[;  go  to  step  1. 
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Step  0  is  an  initialization  step.  Since  in  the  resulting  connection  pattern  for  {A.IJaj  every  switch  of  .1  is  rr):ir,'M  ;r(i 
to  only  one  switch  of  Dq,  a  switch  is  removed  from  .sets  .1  as  soon  as  it  is  determined  in  step  llowcvor. 


Bn  ■*  connected  to  two  switches  of  .1  so  that  a  switch  is  not  removed  from  Bn  until  two  connections  are  niai-e  jC 


the  switch  and  two  switches  of  A.  The  algorithm  repeats  steps  1  to  3  iteratively  for  every  switcii  of  A 


In  step  2,  for  a  given  uG.l,  v  can  be  any  of  the  switches  of  B(j.  Because  each  switch  in  Bq  r.an  be  seiecun!  1. 


a  switch  of  Bg  is  connected  to  two  switches  of  .1 ),  the  size  of  Bg  can  be  regarded  .as  2-2 


there 


ble  connection  patterns  for  {vl.Bg).  In  addition,  because  there  are  2‘  pairs  of  {.l.Bg}  and  2'  pairs  of  1  l.B.J 


there  are  a  total  of  ((2'*~')!)^  possible  connection  patterns  between  stages  i  and  i+l. 

f[((2"~‘)!)"  distinct  topologies  for  the  UDN's. 

1-0 


l.licn:  :-.rf 


l.'C 

.'a'p:  -■ 

o  ■'  -a ,  ■  M 


riie  ICube  network’^  and  the  modified  augmented  data  manipulator  (,VDM)  network'  fail  into  Mns  i 
The  topology  describing  rules  of  the  two  networks  can  be  readily  transformed  in.to  D-construcl  exiircssiuiir 
works,  any  switch  at  stage  t  lias  =  Vt  .  d  <  j  ^  u  — 1.  c.xcept  for  J  ~  i.  .'o  Llie  D-construct  jf  both 
expressed  as; 


a'^s  'I, 
In  iH 
'works 


D'fu'.it)  =  u. 


I/O  > 


-l/n- 


2,0  I 


for  the  ICube  network 
for  the  modiiicd  /VDM  network 


and  llie  mapping  of  k,  is  :  and  ;i— s— 1, 
lively. 


0  <  1  •I--1,  for  the  B'ube  network  and  the  niodilic'i  .\f)M 


1.  FAULT-TOLIfiCVNT  IMPROVENUtNT  FOR  DN'S 


This  -lection  considers  llie  sinudesl  tyfie  of  faiill-toleranl  tree  in  wiiicli  only  one  extra  ilnk  is  adiied  to  everv  •  i 
the  binary  tree  and  the  class  of  networks,  called  cn/uinceu  DiWs  (ABiV.si,  ‘li.it  are  can.stnirted  by  co.jlrsc  eig 


fanit-tolerant  trees.  As  .a  re.snit,  an  FDN  lias  .3X3  switches  .aiui  corresponds  to  the  network  tiiat  re.suit.s  'r'er  '.'.-i 
extra  output  link  to  every  switcii  of  a  L'DN  dcrivcii  in  Section  3. 


The  extra  link  can  lie  eitiier  a  b  iin.-:  or  c.  riiat  a  swit-  h  "l  i  he  h.D.N'  .'ms  'hree  ii'.'.nut  ,.nk:o  ni 

two  1-iin'KS,  or  one  '.-link  and  t ‘'’.a)  O-lluk'  \  O-bni:  or  1-iii.r.'  a.aid  'o  ne  a  ■■.oi  d-iiTie  o'r  I  .o:-,.  i  tfn're 

adililional  U-link  lor  1-link  i  for  liie  switch  .■•.mi  ih..;,  -..re  ealleo  a  ,  unjuij<iie  liriK- -if .cher. 


T!ie  ^nliriiirrd  D-ctiu.'itTiiCl  u.sed  to  construct  ‘in  '-iUN  s  is  billows. 


B'la'-l,...  T 


;>  'or 

V  .V'  =l'l 

r  for  t-j  ,  ^  ■-  10  or  1  1 


where  ,  0  <  ;  <  i  -1;  and  p*  =  0,  </*  -  1  and  0  if  tlic  conjugate  iin.-  -!  ol  t!.''  swiirli  .ire  n.,,nr;- 

and  =  1  if  the  conjugate  links  are  1-iinks.  The  values  for  Pjt  ,  i/n  and  ,  i-fl  <  j  m  n— i.  .arc  dotermined  accrrding 
to  u' ,  and  . 

Routing  in  EDN’s  can  also  be  controlled  by  the  destination  routing  tag.  If  t|.  =  0,  messages  can  be  routed  via  any  oi 
the  conjugate  O-links  and  reaches  the  same  destination;  if  =  1,  messages  can  be  routed  '.'..a  .any  ol  ’.ho  conjugate  1- 
links.  The  bit  determines  which  of  the  conjugate  links  is  used  for  routing.  If  at  some  stage  a  conjugate  0-link  'or 
l-link)  is  bloc-ked,  rerouting  can  be  done  by  sending  messages  via  its  conjugate  link.  .-V.  U-link  (or  1-link)  being  uoucoiiju- 
galt  means  that  the  0-link  (or  l-link)  is  tiie  only  0-link  (or  I-link)  of  the  switch.  If  a  nonconjugate  link  on  a  path  is 
blocked,  rerouting  from  the  switch  to  which  the  blocked  link  i,s  an  output  link  is  impossible  .and  commur.ir.alion  between 
the  source  and  the  destination  of  the  path  is  cliniinated. 


if  a  nonconjugate  link  is  blocked,  the  only  resort  for  rerouting  is  to  backtrack  along  the  routing  palli  (iT.at  ciu  '.in, 
the  blocked  link)  to  find  a  conjugate  link  at  a  !ower-''rder  stage,  from  which  there  may  exist  , an  allernali'  routing  pai.b 
that  can  avoid  the  blocked  link.  Dacktr.acking  can  .also  implemented  as  a  look-ahead  scheme'',  llatvcvcr,  b.aeldr.arkiiu; 
(or  look-ahead)  techniques  require  each  switch  to  be  .able  to  delect  inaccessibility  of  any  output  port  (connected  to  a 
switch  at  the  next  stage)  and  signal  presence  of  the  blockage  back  to  the  switches  of  previous  stages  '  ‘ Even  if  hack- 
tracking  is  available,  an  alternate  routing  path  may  not  exist  if  no  conjugate  links  exist  .at  a  loucr-order  stage  ol  ‘.ne 
routing  path. 


The  Gamma  network*'*''^  falls  into  the  c.atcgory  of  EDN's.  /V  switch  «'  of  the  Gamma  network  has  lliree  successors 


at  .stage  i-fl:  (u ->-2')' 
follows. 


and  (u— 2‘)''',  0  <  :  <  ri— 1.  The  D-conslruct  cxprc.ssions  for  the  Gainin.a  network  arc  as 
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(i)  For  U,  =  0,  =■  -l/i  tlOu,  _,/o,  </„  .1/0  =*  “n  -l/n-l  l“t  -l/0>  Vn  -l/i  *1  =  -  1 /i  ♦  I  “2"' '  and  q,/o  =  I  U,  .I/O- 

(ii)  For  u,  =  1,  =  U„  _,/,  4.1  lu,  _|/o,  p„_,/o  =  »„  -i/i  -^i/o>  P™~i/i  .  i  ”  “n-i/i  4i+2'  and  p,/o  -  «u,  .|/o- 

I'rom  the  D-construcl  expressions  for  Uie  Gamma  network,  it  is  clear  that  links  (u'  ,  p'*')  and  (u'  ,  p'*')  are  the  conju¬ 
gate  O-iinks  of  u'  (for  u,  -  1),  and  links  (u'  ,  q'*^)  and  (u'  ,  q'*‘)  are  the  conjugate  l-links  of  u'  (for  u,  -  0).  Compar¬ 
ing  with  the  D-construcl  expressions  for  the  ICiibe  network,  it  is  readily  seen  that  the  D-construct  expressions  for  the 
ICube  network  is  a  subset  of  those  for  the  Gamma  network.  Links  (u'  ,  q'"')  for  u,  =  0  and  (u'  ,  p’*')  for  u,  =  1  of  the 
Gamma  network  are  the  extra  links  added  to  the  ICube  network.  Fast  routing  schemes  for  the  Gamma  network  are  com¬ 
plicated  distance  tag  routing  schemes"  '^  that  require  the  computation  of  the  distance  between  the  source  and  the  desti- 
riaiion  n  order  to  generate  routing  lags;  by  decomposing  the  Gamma  network  into  fault-tolerant  tree  structures,  it  is 
readily  realized  that  destination  lag  routing  can  be  used  for  the  Gamma  network'*. 

The  Gamma  network  is  topologically  equivalent  to  the  L-VDM  network*  however,  they  use  switches  of  different 

types.  .Since  this  paper  is  concerned  with  only  one-to-one  routing,  the  results  in  this  paper  equally  apply  to  both  of  them. 

It  was  observed*  '^  that  the  Gamma  and  LVDM  networks  did  not  have  one-fault-tolerance.  By  transforming  the  topology 

of  the  Ganuna  network  into  D-construct  expressions,  it  is  e,a.siiy  .seen  that  each  switch  of  the  Gamma  network  has  only 

conjugate  O-liuks  or  conjugate  l-links  but  never  both  and  thus  the  Gamma  network  does  not  have  one-fault-tolerance. 
Past  techniques  rely  on  distance  tag  schemes'^  and  topology  comparison*  to  show  this  property  and  derive  only  one 
fault-loierant  topology  for  the  Gamma  network*  '".  It  will  be  shown  in  Section  5  that  a  great  number  of  fault-tolerant 
topologies  for  the  Gamma  network  can  be  systematically  derived  based  on  binary  and  fault-tolerant  tree  structures. 

5.  FAULT-TOLERrlNT  D-CONSTRUCT  /VND  FDN 
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The  lack  of  adequate  fault-tolerance  for  EDN’s  determines  the  addition  of  a  second  extra  link  to  every  node  of  the 
binarv  tree  in  order  to  form  a  fault-tolerant  tree,  which  can  be  used  to  construct  fault-tolerant  networks.  Also,  due  to 
network  modularity  requirement,  the  network  so  constructed  correspond.s  to  the  one  resulting  from  adding  two  extra 
links,  one  O-link  and  one  1-link,  to  every  switch  of  a  UUN  derivcil  in  Section  t!.  Thus  every  switch  of  the  network  has  one 
pair  of  U-links  and  one  pair  of  l-links.  In  order  to  maximize  the  fault-tolerance  improvement,  it  is  desirable  to  have  each 
of  the  (Minks  (and  l-links)  be  connected  to  a  dilferenl  switch  at  the  next  stage;  otherwise  a  faulty  successor  may  block 
both  O-links  (or  l-llnks).  'I'hus  each  switch  is  connected  to  four  switches  .at  the  next  stage  and  each  stage  of  the  network 
consists  of  i.V  links.  Such  a  network  is  an  I'DN  that  c.-xn  tolerate  at  least  one  s'vitch  f.xiliire  (except  the  failures  in  input 
and  oii'ixut  ■•■.vitches,  whicii  .are  sources  and  destinalioiisl  and  is  i-apable  -xf  pi’rforming  dynamic  reroiiliiig.  This  is  true 
iiecause  rerouting  can  always  be  .lone  by  sending  iness.ages  tlirougb  tlie  .•onjugate  link  of  a  blocked  link. 

The  faull-tolcranl  D-corislrxicl  used  to  construct  the  FDN's  is  as  follows. 


y‘(t‘'dnag.'.0  = 


where  p*  =  “9*  —  >  0  <  J  ix  >—1,  -and  =  pj  =  0,  q^  =  q/.  =1.  In  addition,  to  make  each  of  O-links  (or 

l-links)  be  connected  to  a  different  switch  at  the  next  stage,  the  following  conditions  must  be  satisfied:  p*  ^  pj  and 
'Ik.  * 'it  ■  fof  some  ],  i-fl<y  <u— 1.  .So  Pn-i/o  ^  Pn-i/o ^n-i/o  ^  is  connected  to  four  distinct 

switches  at  the  next  stage.  The  pair  of  switches  p'^'  and  p’"^'  and  the  pair  of  switches  and  q'*^  are  called  conju¬ 
gate  ,sailc/ie.s.  (u',  p’^')  and  (u',  p'"")  are  conjugate  O-links  and  (u',  9’’''')  and  (u',  arc  conjugate  l-links  for  u'.  A 

pair  of  conjugate  switches  are  on  two  distinct  routing  paths  to  the  same  destination;  thus  they  can  reach  the  same  subset 
of  destinations. 

.Mgoritlim  CONNECTrON-I  can  be  adapted  to  lind  conjugate  links  for  .and  form  FDN’.s  from  a  UDN.  Let  U{u)  be  a 
successor  of  a  switch  u  in  the  UDN,  E(ti)  be  a  successor  of  the  switcli  u  selected  for  the  FDN,  and  the  deRnitions  for  ,1 
and  Rg  be  the  same  as  those  for  algorithm  CONNECTION-1.  Also  xlcfine  .1  to  be  a  set  of  switches  that  contains  the 
switches  of  A  whose  connections  with  switches  of  flg  were  made  and  which  were  removed  from  A  for  the  construction  of 
an  FDN.  The  following  algorithm  CONNECTtON-2  Qnds  a  conjugate  0-link  for  every  switch  of  A  of  a  UDN. 

Algorithm  CONNECTION-2 

step  0:  in-degree(u)  =  0  for  all  vEOq 

step  1;  If  A  =  ./),  done. 

step  2:  If  |flg|  =  1,  let  Rg  =  (u),  and  if  v  =U(u),  uEA,  connect  u  €A  to  vEOq  such  that  uyt(/{u),  disconnect 
u  G‘l  and  F{u  ),  and  connect  u€/l  to  F(h  );  in-degrce(u)  =  in-degree(u )  -M;  go  to  step  I. 
step  3:  Connect  u  G  A  to  u  G  Rg  such  that  v  ^  U(u)  and  in-degrce(v)  =  in-degree(t; )  +1. 
step  I:  .1  =  ,1  —  {ii },  ,1  —  A  +  [u  ]  and  if  In-degree(ii)  =  2,  Og  =  Bg  -  (u  };  go  to  step  1. 


'  N-  V-  V'  -v.- 


Algorithm  CONNECTION-2  has  a  more  stringent  constraint  for  selecting  O-iinks  ’han  ',‘iat  for  a.'goritiiin 
CONNECTION-I.  It  is  prohibited  from  choosing  the  same  successor  of  a  switch  in  the  l.^DN  as  llie  conjugate  h-succesLor 
of  the  switch  so  that  each  switch  is  connected  to  two  distinct  O-snccesstirs  in  the  TON’  Not, re  'hat  .vhrn  ■.lo-rr  ■;  ,  .,;v 
one  switch  u  left  in  /?q,  there  may  have  one  or  two  switches  left  In  1  and  v  may  also  ur  tiie  s'lcci'ss^)^  nf  tiie  switrni  s 
in  A  in  the  UDN;  thus  the  only  possible  connection  is  to  join  the  switeijes  left  in  ,l  ami  the  same  sacces-*ors  of  lueirs  n 
the  UDN  (i.e.  u).  Step  2  adopts  a  remedial  afrproacli  to  exchange  the  rontiection  the  one  mat,  was  'Waon  :,eu  ;i  i 

previous  iteration  and  whose  switch  of  ffg  Is  not  r.  successor  of  the  switriies  left  in  .1  :n  'hr  I  UN.  The  -  n-ittge 
always  possible  from  stage  0  to  n  —  >  because  (a)  if  there  are  two  switenes  !-ft  in  .1  .-.mi  they  .ir.'  ':;e  pr.'.irc-rrs,,., 
the  UDN,  then  any  of  the  (2"  '— 21  switches  in  .1  can  be  cnosen  for  the  exciiang'u  ;ii,a  t,'|  if  'luTe  is  oniv  oni,-  •  .■  ■ 

in  ,'l  (so  tliere  are  2^  '  —  I  switches  in  ,1)  anti  ;t  is  alst>  the  predecessor  of  ;■  Ti  U-e  tie-n  the"*'  .nust 

switch  in  .1  that  is  also  a  predecessor  of  v  in  tlie  UDN  ;for  ne  out-.lcgr.  e  of  swUcli  of  /(„  s  t  a,  n  i  ):i:ier'  , 

pattern  of  the  UDN),  and  any  of  the  swiiciics  m  .1  r;,ri  i,e  •■■Av^on  .Tr  '.lie  exri.ange  ne 

(j  =»  n— I),  l.-l  I  =  2  and  |fl,j|  =  1,  so  both  conjugate  0-links  laiid  l-liiiks'  of  a  switch  n  .V  ,ire  ■onnerie<i  to  t,.';  n  i"  ,  , 
put  switch  (destination). 

Assume  that  at  the  last  iteration,  the  last  switch  v  left  in  /(g  is  always  the  “urce--sor  of  :.rm'a  i  \  n  • 

UDN,  then  at  each  iteration  a  switch  in  -A  ran  choose  only  one  cu'  the  swit.thes  ;u  /T)  -  )i'  I  as  .:.i  sucersou  .i:  o  ■ 

that  at  each  iteration  of  the  algorithm  for  ''viiici).  for  each  switch  n  in  ,V,  /ig  always  i-onia uis  i  (u  tn-"  v..: 
can  cliooso  only  one  of  the  switches  in  Bn  -  i”)  -  {('('til  as  its  succe-ior.  I  sum  i  lie  arguinenis  similar  ■  i  'no',’  laru  : 

'I  - 

computing  the  total  number  of  possible  !  D.N  topologies,  it  can  oe  siiown  'lial  tmTe  "x.si  at  .east  (  |  ‘/j  < 

I  ) 

ble  FDN  topologies  for  a  given  UD.N. 

Examples  of  FDN's  are  U.e  F  network'  and  the  iaapp.a  networs.  lie  i.aiii-toc,  a.i.t  lo-om  t;  a  ‘‘v,  r . 

these  two  networks  arc  a.s  follows. 

fi)  For  the  F  network,  m)  =  a.,  _ , 4 -i  o’  a  “  '■  '-i  j  1  ■  ‘ ' 

Vn-i/Q  =  ’N-i/ia|l'h-i,o- 

lii)  For  the  Kappa  network,  =  _•  ,j,  'j^  ;  e  =  i  ,  .  U  '•  >  ^ 

i’a-l/iM  =  '‘n -1  '1  ■  I +2'  .rt,  0  ~  **''1-1/1'  -"a-’  i-l“-  ■*""  C  '  'a-'  r 

;N-i,..i  =  -N-i,..)  -il'''.  ;h,'0  -""'-I,)-  =  "s  To  ; 


The  i\anpa  network  was  'ierweu  as  a  resuil  of  'lie  ■  ":;.;>ar'='in  f.w'Ai  a 

tnouilicd  ll.aseiine'*  networks,  ily  embedding  '.he  modiiieu  ilaneiim-  ■.v.irr;  .n  'c 

redundancy  in  the  tj.amma  network  s  oi'served  and  tiie  addition  oi  an  "xT.a  linK  'o  evi  .w  0. 
work  ’.va,3  proposed  to  achieve  .symmetric  rcti'iiidaney  .and  ono-l'aait-t oierance.  .'■>■  ■■I'li'oSii,.',  ■  ic 

faiill-tolcrant  tree  structures  and  transforming  its  lopoiogy  describing  rule  into  it-const.-.ici  ■  \;  ressu 
fanit-lolerancc  In  tiie  Uamrria  network  becomes  evident  li.c.  some  switci-cs  i.ave  •■.■I'lCon;  icai.. 

I  I  1(2'*  f.aiiit-toleranl  lopologies  (by  making  ail  '.witclies  I'.av"  n.'ug.ate  0  lines  and  1  oiir;:-!  ■ 


ihc  Kappa  network  Is  merely  one. 

There  are  two  po^ssibie  impiofncfitations  for  routi/ig  nfiiocie.s  Tor  i-|)N*.'i  in-.J  .ue  wi  ilic  ■■.ovi, 

graphs.  The  first  scheme  rf'fjnircs  no  oornpntalion  for  reroiuin^  tat;3  aii'i  can  (i>  naiiiit:a!.y  bypass  i'^ni's  in  ’.Ji"  '.'‘i"-  'rt-,. 
TIic  second  scheme  requires  Llie  computation  for  rerouting  lags  by  ihc  souf'c  oi  messages  but  iias  the  auvantattf'  'ii 
capaldc  of  handling  more  complicated  multiple  faults  in  the  network. 

Ill  the  first  scheme  the  flcsLiuaticm  address  can  be  'isotl  directly  as  the  routing  *.aei  i  e.  / .^  i  ■“  hj..,  Ih.'i  '^eh'  me 
.assumes  that  each  switch  u'  uses  and  {u'.i/*  ')  as  the  deiault  (l-fink  and  l-link,  respectively,  for  routiiiii  mes¬ 
sages.  If  ==  (},  link  (u‘,  is  used  for  routing  and  if  =  1,  link  (u*.  */’  is  used  for  roulitiK-  binks  (a  .p  }  and 

.are  regarded  as  spare  (inks.  If  a  fault  in  the  default  link  is  <lctcctcd  or  the  link  is  blocked  due  to  laiiure  in  the 
switch  to  which  the  link  is  an  input  fink,  the  switch  automatically  reroutes  messages  via  the  conjugate  link  of  the  blocked 
link.  I'hus  rerouting  i.s  transparent  to  tiic  source  of  the  messages  .and  no  rerouting  tags  need  ‘o  ue  conquiicd.  h.ach 
switch  requires  negligible  extr.a  hardware  I'or  the  detection  of  blocked  i’mks. 

The  second  scheme  requires  the  source  of  messages  to  compute  .a  rkjuting  tag.  1  Jie  .source  :s  .'issunien  to  .-Anuw  .nr 
loc.ations  of  faulty  links  .and  switches  in  the  network  so  that  tli^'v  can  be  compared  against  tlie  links  and  switenes  on  -he 
routing  path  to  decide  whether  faults  occur  on  the  routing  path  and  rerouting  is  necessary.  .V  routing  t.ag  con.sists  oi  *: 
<iigits,  the  k^-fk  (iigit  of  the  rotiting  tag  being  represcnteii  l>y  two  bits  ami  tj,,  -'ailed  the  .tatc  b'J  and  the  'ic^^tvtnluin 

hit.  respecti vely .  I'he  destination  nit  specilies  the  use  of  a  d-iink  or  1-iink  for  routing  and  ti.  —  .  i)  ■  k,  n-1  and 
I)  .  [■  ■  ■  n  — 1.  If  ty.  --  f),  a  O-iink  is  used  ami  if  ^  I,  .a  l-link  is  used.  Tlic  state  ()i'  i.s  used  to  specify  which  one  os 


Although  other  fault-tolerant  topologies  for  the  (la;ntna  network  are  also  possible  based  nii  the  comparison  ol  i.opoio- 
gios^'*,  the  number  of  them  i.s  !ess  tli.an  that  derived  by  using  the  approach  [imposed  in  this  paper.  This  is  liecause  the 
block  structure^  Is  only  .a  subset  of  the  binary  tree  .structures,  as  e.Yfilai.-ied  in  -Seefion  2. 
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AliSl  UA(  T 

Because  processor  arrays  have  only  ronnections  hrt  .’ren  neighhorioc  j'rot;r‘--r'rs.  fault -t^  I(  r.ii  f  f-  crfi.  'nf'  iv 

require  additional  intcrconnert,  s\v}?rhing  and  fojjtr<-l  hard'^ar*'  in  ordrr  *o  allow  hjr  rerf.fific'i  rr<J  Ion  v'-lw'i  faulN  o--  nr 
In  general,  the  l<arger  tlu'  reronfiguration  f.'vpriiHUt  \ .  •  lu-  gr«at<*r  the  prol'a  i>iiit  y  tliai  a  pr(vr>sor  nnay  ran  a 

givr'Ti  ribution  of  faults.  In  (»t'fu‘r  ‘h*  ro.-raee  ‘d  Mie  rcrnfifiaurat  inn  f'r-u  rjurc  ir\c  dir'-'tly  v.i'l' 

amount  of  extra  fiarduare  rr'iuir'-.i  fn  n.pprr*  '  ll  -.uevr.  !}.<  i-  tfir  nnl>  if  ’.Ik  ;  'i-i*'  1  h.cd  A  *'.  •!'  u-d  fail  '•  ' 
For  this  reason  a  fid  depeml'iig  on  ,  a  o  g  •  >?  Ii  r  I  i  *  nt  f  f,r  ‘  1  <>  }»»•(  -  ^.-.r  .i  rra  a  -d  ?  f  <  .<  r)|  i  ,<  <•  1,  ]■:■..  '  ••■■r 

tinrt  rrcfin  r>Gu  ra  t  ion  ''rhei!U‘-'  ina;.  fi'  l>'  d  Miiit<’  '.-r  dilftf?.'  .••'lo--  Al'-o.  .n  r.'i»'rM!.  vion  ;ro'''-  "-i  '.-lif’i'''  j 

sill!  result  in  nnarirplahlv  l(»u  r*  ii  a  1 'ili  t  i*--:  m  ■  »  r;.  lire.'  prnr»  •  ••. ,[- 

This  pap<‘r  prnpe)Sf's  a  <  I-t' ^  T  r»  «'on  Iig,'! ifu.  iiemes  v-iiich  havt  ■'’frnr'hicai  nn‘'ir'.  A'^'orrint;  'o  '1  • 

approach,  a  processor  arrav  is  los'.ie'if -v  partition'  d  iisio  sifjnll^r  suharrays  atid.  oru  ••  faults  .;r.  r-  '  onfiG  ira’i'u  'ak*-- 
place  within  each  of  the  subarrays  (win  re  fautis  art'  p-e-rni*)  d  possible  and.  otfiorwise.  Mie  fu!!  sMo.ir^’ay  !-;  rf;‘>(a'  ‘'i:  t*y  a 
spare  subarray.  Arrays  of  this  type  are  r'Trrr'  '!  to  as  bi  li'vj  }  fault'lolerant  prorc  s<;r  r  arrays  atid.  h\  aliouin^-;  '^evr  ra!  lev¬ 
els  of  rf-rouliguratiofi,  luulti-h  ve|  arrays  rap  kp  dfdined  siinila"ly.  While  s'^veral  h  v'ds  of  r< ‘•/uiligu’';' t  ion  are  i'n  tie 

rase  of  two  |e\eis  is  enipjiasi7fd  in  I  Irs  pap^r.  .\io.  Mu*  re  oTifiaiira  ‘  iiui  srhrirx's  ns/d  in  ■•atd'  leve'  a''*  nr»'  i  t 
i<!enlical,  dhis  class  of  hiernr'dii'a’  rerontigu'-ation  scin  lU' s  prf>vi<le  iuu<  h  high^'f  r<iiabili!y  tinu'  pr»  vir)ii‘Jv  jt!'o[''e'ii 
ones,  partirularly  in  the  case  of  \f'ry  larg"  array^n 

To  design  a  hierarchical  reconfiguration  '^chenv  f^'r  a  given  processor  array  it  is  ri'ccecsary  tc,  I'bor  •'  Ike  si.^>  of  ttp 
subarrays  for  every  level  in  the  Ikurarehy  a«:  veji  ps  {|,o  reconriguration  scluune  at  that  level  A  design  :iie*bofl(;l('ey  is 
provided  which  mat  hemal  icaliy  volves  tine#,  i-roblems,  i.e.  it  enables  tfie  cIiokm'  •.(  tfie  i!'arr-‘y  a.n<:  ‘ke 

reronfig'i ra t inn  scheme  to  lie  i|«ed  at  each  ''‘vrl  «('  »o  obtain  a  prore-sor  array  witli  ofitiiral  reliability 

In  a  fault-tolerant  processor  array  (FTI'A)  rt'duiulancy  may  (>r  may  not  be  provided  for  e  ery  type  of  component  of 
the  array,  As  a  typical  possifik  rn<;e.  an  f  d’lVA  may  (lave  epare  processors  bu*  no  r'-niiindanry  ;  rovidod  f^T 
switches,  and  control  logic  to  imphuiK-nt  tfie  faub, •  toh-rance  ^rln-me.  'riu-  paper  discu'^^c'c  the  in.pacf  e;f  no*;  r-(/e»(d,-rrd 

hnrduarr  on  the  reliability  of  an  (•  Tl’A  and  •'Im'ws  that  I-  IFAc  wiOi  superior  rcdin’.iMi*y  resnl'  ti  the  fa  lit-tob  r'’ er- 

sc  heme  u';r(f  has  a  h  iera  refi  i(’a  /  sd  rn(  i  n  re.  'I  U*  <'  }  M  t  r<’  '  a  IF-d  /rrr/  /  V  7’  I  A  •• ;,,)  a  ?  .  '  hcni't '  -  . )  pr'  '  I  i  re  -  ir 

iJir  design  of  optimally  reliable  Ml  1  I’A'^  is  ahn*  dcvp-ril''-':  bi  'hi:  pape  r. 

1  he  relevance  of  procesQor  arrav*;  fl’.A's)  rf*  ms  nrU  only  from  their  ahilily  tr>  mert  lln  eouipu  t  a  I  J"na  1  demands  rt 
many  real-time  applicalions  but  abo  from  'h-ir  Mutability  bir  rilieieni  itnphiuf'nl  at  ioic-  udim  \('r\  barge  Scih' 
Inlcgrat ion/Wafer  Scale  Integration  (Vb'^l/W^l)  ‘r-chno’ogy*.  iy.arge  BA's  ran  b'-  implemented  !>>'  ri'plicaiing  many  linus 
a  small  basic  harrlware  module  rvidch  might  include  no*  only  Ifie  prore.ssors  hut  aho  some  hardware  for  switcfiing.  control 
and  interconnections.  The  design  of  the  ba^ir  tnodule  can  he  ••nectively  optimized  and.  tJuirefore,  it  can  hv  usetl  to  imple¬ 
ment  area-  and  lirno-eflicient  arrays  of  arhitrary  sire  with  a  low*  r  cost.  In  this  paper,  it  is  n‘=;:'Uined  that  a  ha  ie  nvxhdc 

corre.sponri.s  to  a  .single  proce.s.sor.  Jh)Wfvrr,  sirin  not  all  itpcj-.-fuinect ,  swiieh'.iig  an<l  contred  logic  ran  :>e  implemented  in 
modules,  the  results  of  tliis  paper  .‘pidy  also  when  tether  type*;  moduu's  are  used.  .Assume  (he  relinhili’y  <T  *  arli  module 
IS  the  rrliahiiity  of  IV\  witli  N  modules  is  rj,j  .vluch  can  l)e  rather  low  when  N  is  very  big.  ’I  h’s  is  efpiivalent  to  the 
statement  tliat,  rrgardle.ss  of  the  reliability  [>er  unit  of  area  of  the  PA,  the  reliability  of  a  PA  with  large  area  is  very 
small,  unless  fault-tolerance  i.s  provirlud.  I'rom  this  perspective,  it  bccoiries  clear  that  the  flr.'^t  statement  is  true  indepen¬ 
dently  of  the  type  of  module  used  «ince  what  really  malli'rs  is  the  overall  amount  of  area  taken  hy  the  array  and,  a^sum. 
ing  no  fault-toierance,  the  reliability  of  a  large  array  with  '■mail  modules  is  Mie  same  as  tlint  an  array  with  the  ^n’oe 
area  hut  larger  modules,  d'hu^,  colia  liilit  y  f»rf\<rit  •  ftu'  ry  phdi  of  (wo  cT  (fu*  adv.anf.age':  <7  pro'*r'^«^or  arrav-'. 

i.e.,  iT’odularily  and  exlensibirdy. 

Extensive  work  has  ijcen  done  towards  de'-ismg  i  'l'PA  designs  and  a  non-exhaust  ive  list  of  refcrcncf  s  is  ‘  and  the 
references  therein.  Reconfiguration  approaches  ti»at  rrsemhjr  those  (lesrribod  in  some  of  t!ie»-'e  references  are  used  to  il’ug- 
trate  the  basic  ideas  and  results  nf  tills  paper.  These  me!  hod':  iuclude  Folumn  Reriundancy  and  others),  niogf'iies 
Complex  Fault  Stealing  ^  and  Trif)ricated  Mode.lar  l?*  duiul'iMcy  Succinct  <|o<;rript ioTis  of  (he  approaclu's  tha' 
these  methods  are  made  in  Section  it  of  this  p,«p(  r. 

'Typically,  previously  proposed  I''1'P.A  designs  .assuiTie  the  iion-redundant  harrlware,  i.e..  tlie  links,  switches  and  c(in- 
Irol  logic  nevur  fail  during  array  operation  M  i-;  a'^suou*!  that  these  (omponenls  arc  small  and  simple  and  crui 

therefore  be  designed  carefully  and  cnm^rrvn f i\ »  ly  ‘-o  to  minimize  the  probalkdily  of  operational  faiiuri'.  1 1’sloricaliy,  this 
approach  finds  its  origins  in  the  nature  of  enrli'-r  fhr»rouic  svsirrm;  where,  d*ie  lo  a  I*  -s  reliable  tjrhindogy  and  (he  Inrgp 
sire  of  logic  devices,  bulky  ncti*.  logir  vv.as  oiMch  fior'"  li’  •  ly  to  fail  Mian  smnlb'r  a.rid  <  asj.  r-tri  impl' mt  nt  passive  Conner 
linns  and  .siiriplc  switches.  .AnoMor  ju  vt  ifir.’i  l  iou  for  •be  v.^lidit'.  i^f  »he  fauiiMrr*-  mt  r'oniu  cl  a'-’-niiq't  ioit  Mu'  belief 

Miat  t}ir  rebabilitv  of  Interronn*  c'  cati  alwav*-  t.<-  ifopc'  ^  f‘  l»v  replicafing  ineyj  <  r-bv  <*  v.  iret  \*  Irr-a'-  rri'-re  la  t’or,?  i 
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schemes  with  limited  redundancy  are  required  for  the  expensive"  logic.  With  some  current  technologies  the  same  type  of 
assumptions  may  still  be  valid  in  some  cases,  e.g.,  fabrication  faults  in  the  wafer-scale  integrated  systems  described  in 
occur  with  probability  of  91%  in  wires  versus  Un%  in  processing  elements  (PR's)  and  reconfiguration  mechanisms  are 
80.09%  reliable.  However,  it  is  well  known  that  these  assumptions  must  be  re-evaluated  in  light  of  available  Very  Large 
Scale  Inlegralion/Wafer  Scale  Integration  (\'I,S1/WS1)  technology,  particularly  when  operational  faults  (instead  of 
manufacturing  defects)  are  to  be  considered  In  fact,  in  VI,SI/WSI  systems,  interconnect  can  take  up  to  50%  of  the 
total  area  and,  in  simple  terms,  it  is  made  by  using  processes  anil  materials  similar  to  those  used  for  active  logic.  Thus, 
it  is  more  appropriate  to  charncterire  the  lifetime  reliability  of  dilferent  types  of  components  of  a  system  in  terms  of  the 
.area  that  they  require  and  the  relialhlily  per  unit  of  area  of  the  niidinm  lea  <|  fnr  llnir  iinpleini  n  I  .a  liote 

The  understanding  of  the  impact  of  the  operational  reliability  of  non-redundatit  hardware  in  the  overall  operational 
reliability  of  an  FTP  A  is  one  of  the  goals  of  this  paper.  Similar  studies  have  been  done  for  other  type  of  systems  (for 
example,  in^’’  ^')  but,  to  our  knowledge,  they  have  not  been  reported  for  F  l'PA's.  Fhe  idea  of  considering  multi-level 
F'J'PA’s  is  a  consequence  of  the  understanding  jii't  mentioned,  in  the  sense  that  the  hierarchical  structure  of  these 
Fl'PA's  reduces  the  amount  of  non-redundant  hardware  in  the  overall  array.  It  is  appropriate  to  mention  here  that 


related  or  similar  concepts  have  been  lnde[>eiidi  ntly  proposed  before  in 


however,  the  motivations  for  putting  for¬ 


ward  those  ideas,  while  valid,  siem  lo  he  dilferent  from  the  reasons  presented  in  this  pa(>er. 

Section  2  of  this  paper  presents  the  basic  ideas  and  models  that  underly  the  approach  and  studies  reported  in  this 
paper.  In  Section  .3,  the  reliability  characteristics  of  single-level  FTPA's  are  discussed.  Four  different  type  of  FTPA's  are 
described  and  area  and  reliability  estimates  for  these  FTPA's  are  presented  and  discussed.  The  approach  used  to  com¬ 
pute  these  estimates  is  also  explained.  Sectiot\  1  introduces  the  concept  of  multi-level  FTPA's  and  provides  reasons  why 
their  reliability  is  potentially  higher  than  for  other  single-level  l''TF’A's.  The  issue  of  how  to  design  optimal  bi-level 
FFP-A's  is  the  topic  of  Section  ,5.  It  is  shown  that  it  is  not  feasible  to  attempt  to  derive  optimal  design  parameters 
directly  from  the  analytical  expressions  tliat  describe  the  reliability  of  a  bi-levcl  FTPA.  To  solve  this  problem,  accurate 
functional  approximations  of  those  expression.s  are  proposed  and  used  for  optimixation  purposes.  A  case  study  is  briefly 
described  that  illustrates  the  very  high  reliability  improvement  achievable  by  bi-level  FTPA’s  in  contrast  with  the  very 
poor  reliability  possible  with  a  single-level  FTP  .A.  '^ectiofi  fi  is  <ledicated  to  conclusions. 

^  M ASK :  IDEAS  AND  MODELS 

A  very  general  model  is  liserl  to  represent  characteristics  of  FTPA’s  that  arc  relevant  to  the  purpose  of  this  paper.  A 
processor  array  is  described  by  a  1-tuple  (P,  L,  S.  C)  where  P  corresponds  lo  the  part  of  the  array  used  for  processing,  L 
denotes  all  the  link  components  in  the  array,  S  is  the  set  of  switching  components,  and  C  denotes  the  set  of  control  logic 
components  responsible  for  the  control  of  processing,  linking  and  switching  elements.  Clearly,  in  order  for  an  array  to  be 
fully  operational,  each  of  these  component  parts  of  an  array  must  operate  correctly.  In  general,  a  component  can  belong 
to  only  one  of  the  sets  P,  L.  S  and  (L  However,  in  order  to  introduce  fault-tolerance  in  the  component  P  of  a 
P.A  — (P.L,. a  new  array  (P',  L',  S',  ('')  is  obtained  where  not  only  P'  but  also  T/,  S'  and  C'  result  from  adding  elements 
to  P,  L,  S  anri  C,  respectively.  Consequently,  the  relial)ilily  of  L',  S'  and  C'  may  also  differ  from  that  of  L,  S  and  C.  The 
dilfereiiee  in  reli.shilily  depends  on  liolh  the  added  elements  and  Hie  logical  and  pliysieal  organir.ation  of  the  eomponctils 
of  the  array. 

W'itlioiit  loss  of  generality,  assume  tli.at  au  F  TP.A  ■  (P,L,S,(t)  contains  (nxn)  -t  k  PR's,  where  k  corresponds  to  the 
number  of  spares.  I,et  Ap(n,k),  A|(n,k).  A5(n,k)  and  Ar(n,k)  denote  the  areas  used  to  implement  L’,  L,  S  and  C,  respec¬ 
tively,  and  let  Afn,k)  —  Ap(n,k)  i  A|(ri,k)  I  .A5(n.k)  I  Ar(n,k)  denote  the  total  area  of  the  array.  Lot  rp,  r|,  r,,  and 
denote  the  reliability  of  tlie  unit  of  area  of  an  element  in  P,  L,  S  and  C,  respectively.  Similarly,  let  Rp(n,k),  Ri(n,k), 
HJn.k)  and  R,,(n,k)  refer  to  the  reliability  of  I’,  L,  S  and  ('.  Then  ttie  reliability  of  a  proeessor  array  is  given  by 

R(n,k)  —  Rp(n,k)  •  R|(n,k)  ■  R,fn,k)  ■  lt,.(n,k)  (2.1) 

and  tlie  overall  relialiility  of  the  array  is  at  most  as  good  as  the  lowest  of  the  reliabilities  of  P,  L,  S  and  C. 

In  general,  proposed  reconfiguration  sehemes  adopt  as  the  basic  replacement  unit  a  module  and  assume  that  the 
hardware  outside  the  basic  module  (,i.e.,  the  non-redundant  hardware)  is  fault-intolerant.  This  suggests  that  the  model 
just  proposed  ran  be  generalized  further.  In  essenee.  the  area  of  a  F  I'PA  consists  of  some  fault-tolerant  area  Afj(n,k)  and 
till'  remaining  area  .A(;(n,k)  "  A(ii.k) -.Aft(n,k)  wliirh  is  fault-iiilolerant .  Denoting  the  reliability  of  these  areas  by  Rq(n,k) 
and  Rc,(n,k),  rrspeel ively,  the  reliability  of  tlie  F'I’I’.V  is 

H(n.k)  -  R„(n,k)  •  Rfi(n,k)  .  (2.2) 

In  olher  words,  given  an  F'l'PA  wilh  (u  n)  Ik  modules  (where  k  is  the  number  of  spares  and  the  unit  of  area  is  defined  as 
the  area  of  a  module)  and  letting  e,  denole  I  he  probability  tliat  the  FTPA  ran  rerover  from  the  failure  of  i  modules, 
i  -  k,  equation  (2.2)  heroines 


R(n,k)  -  •  \)  r, 


4"''^  '’(1  ~  rj 


I''.\prr<is}ons  (2.2)  arifl  (2.3)  ptvv  vaI’k)  for  nny  fvpr  of  I*  ri'A.  Without  loss  of  prnrralily  arn)  in  ordrr  to  facilitate  the 
presentation  anrl  discussion  of  (hr  idr-as  nnd  rr'^ulls,  several  assumptions  apply  to  the  remainder  of  this  paper  and  are 
drsf'rilied  T)rx(.  Given  any  ri'l’A  (IMy,S,(*),  i*  i^  nssunird  that,  (1)  V  is  fault,  tolerant  and  Iv.S  and  C  are  fa\i!t- 


intolerant.  i.e..  \g(n.k)  ^  A|(n.k)  t  A^n.k)  4  A^(u.k), 


K,(n.k)'l?,(n.k)-H4n.k),  Art(n.k)  -  Ap(n,k), 


F?f,(u.k)  -  l?j,(n,k):  (2)  at  tlu-  rxieption  of  tlu'  rrmnnKuratinn  ‘^rlirtne  tisrd,  all  otluT  prorpflures  required  for  fault  lolrr- 
nrp  pprfp'^t  (l- g..  fault  (h-tf  rlion  aufl  lof.'iliMu)  .TUfl.  tlu'rf’forr,  c,  in  oquatiou  (2.3)  rorrpsporuls  to  the  probahility  of 
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siii'rr<!sfiil  reronriRiir^l ion  i^ivri,  i  fnultc  I'l-,  i;rr.-i| 

[•  roto  ciniation  (2,.'?)  il  is  possitilf  lo  inf' r  iosi.i.r! mil  •■linn-n.-risi  irs  llint  .npply  (o  all  I  l  l’A's.  In  pari i'Milar,  il  is  of 
inlorost  lo  aoswrr  thr  (jnrstions  of  Imuv  Rii  .k'  \,irir«  ulicn  irmn’  -.paro  I’l'i's  nro  ;"li|.‘,|  tn  tho  I  11' A  so  t|iat  (a)  llio  rali« 
hrlwopti  till'  nunilir-r  of  prori  ssors  n  aiol  llii-  r  spari-s  1.  -lais  ((jii-'-nji  m  iu  r  words,  llif  rrrlvritim'.r^j  mho 

tk)yk  sl.ays  ronslanf.  or  (I*)  th*'  nuodifr  «»;  psiwa  <  ..or*  n  rrtnains  latris’ a  n  i .  )  o  .aoswor  i  firsr  ipK'si  j^ins,  Ihi-  '..aria 

lions  of  Iffjn.k)  .and  Hfi(n.k)  willi  nn  rf  a  *  s  ]ti  k  lies*  an.ilvrod  nofansi'  R.jn.ki  -  rr.  .  il  's  rlcar  lliai  tliis  ’iTip 

dcrrr.asrs  in  l.olli  r-.asps.  With  rrso.  i  R  n,  I-  i-',.  -  ht-m i  mti  I'rm  in  (o '(il  it  i.  nsrfql  R  so,.' idr-r  Itm  sporial  rasi 
Hhrn  I'l  rqiials  a  ronstanl  c,,  for  a'l  I  i  1-  -.i'  n.--.  m..  |i.  '.lo'-.r'  l-'i.la.i  'I'nii  iio,,rri-i  ’’  v.kl''  s !  a  ‘  r  c  'tia',  ''■  ■■ 

siiMiriont  ly  l.argo  n"ik,  Rfdn.k)  f.an  in  a  pproyinia  t  <  d  I*-  '  !i*  'sorniai  lian-sia'i  liiii  rd-ip  loii  fain'ioi  llain.!))  '.’.in'- 
k  (n’  I  k)  (1  r  ) 

\(n.k)  -*  - ~  I  o^  impo  k.  (wlo  I.  i’.;oli.  ili:-t  ,  -  n  I- is  •.■■rv  'ari-.i  I  lavfn.kli  r  n  lio  a  pproaii-iai  ■  I  -'s 
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column  and  each  row.  Two  extra  blisses  and  corresponding  switches  arc  added  to  allow  for  the  "alignment"  of  input  and 
output  ports  with  operational  I/O  processors.  All  switches  arc  individually  controlled  according  to  a  distributed  algo¬ 
rithm.  Switch  control  signals  are  generated  by  neighboring  logic  whose  inputs  are  originated  from  similar  control  logic 
associated  with  four  next-  neighbor  procc.ssor.s.  The  reconfiguration  algorithm  is  siicce.ssful  as  long  as  the  number  of 
faults  is  not  very  close  to  the  number  of  spares  and,  otherwise,  the  success  rate  decays  as  the  number  of  faults  increase. 

In  an  FTPA  using  the  TMR  approach,  each  processor  is  triplicated  and  its  outputs  are  voted.  This  corresponds  to  a 
static  form  of  redundancy  and  there  is  no  reconfiguration.  Therefore,  the  only  extra  hardware  added  includes  the  voting 
logic  and  the  minimal  wiring  rcfiuircd  to  link  it  to  three  processors.  This  "naive"  TMR  approach  is  not  to  be  confused 
with  the  more  elaborate  approach  proposed  in  . 

A  precise  estimate  of  the  areas  required  by  I'''l'PA’s  is  possible  only  if  their  designs  are  carried  to  completion.  How¬ 
ever,  the  conclusions  of  this  section  do  not  depend  on  extremely  accurate  area  estimates  because  relative  rather  than 
absolute  information  about  .area  is  sought.  In  p.articular,  it  is  the  goal  of  this  section  to  illustrate  (a)  how  the  reliability 
of  each  FTPA  design  varies  with  the  array  size  and  the  number  of  spares  used  and  (b)  bow  the  reliability  of  different 
FTPA  designs  compare  with  each  otlier  for  different  array  sizes.  Thus,  it  suffices  to  guarantee  that  the  rules  and  assump¬ 
tions  used  to  estimate  area  are  consistent  across  different  FTPA  designs  and  capture  the  variation  of  reliability  with 
array  size  and  number  of  spares.  These  rules  .and  assumptions  are  described  next. 

The  unit  of  the  area  is  defined  as  the  area  of  a  PE,  i.e.,  A.(n,k)  =  n*+k  =  total  number  of  PE’s  in  the  FTPA.  The 

D|  * 

total  area  taken  by  links  is  A|(n,k)  ^  a|  )]!;,  where  O)  is  the  number  of  links,  Ij  is  the  length  of  the  ith  link  and  a;  is  the 

i  1  ^ 

area  of  a  link  with  unit  length  (aj  can  also  he  thought  of  as  the  wire  width.)  The  length  of  the  links  depends  on  the 
geometry  of  the  FTPA  layout.  Assuming  that  PR's  correspond  to  points  in  the  Cartesian  plane,  the  length  of  a  wire  is 
assumed  to  be  the  Manhattan  distance  between  the  source  and  destination  of  the  wire.  The  total  area  taken  by  switches 

D< 

is  A,  =  a,  ViniXoutj  where  n,  is  the  number  of  switches,  inj  and  outj  correspond  to  the  fan-in  and  fan-out  of  the  ith 
i  1 

switch,  respectively,  and  .a,  is  (lie  area  of  a  switch  for  which  in,  outj  —  I. 

The  area  taken  by  control  circuitry  consists  of  the  area  taken  by  the  logic  that  generates  the  control  signals  and  the 
area  taken  by  the  links  that  propagate  those  signals.  In  the  rase  of  locally  generated  control  signals  the  area  taken  by 
links  is  .assumed  to  he  zero.  Also,  it  is  assumed  that  the  area  of  the  logic  necessary  to  control  a  switch  is  proportional  to 

ft  .4. 

the  si^e  of  tlio  switch.  Then,  A,.  ^  'h  whero  Adomi  =  1]  mjxoutj,  is  the  number  of  the  switches  with 

i-l 

lornl  rontrol  and  a,,  is  the  arm  of  the  smnllrst  logic  rientent  (e.g.,  a  1x1  switch  in  which  case  =  a,),  and 

^rgiohai  =  (a^yinixoiitj  -t-  a|xl,.,),  where  n,,,  is  the  numher  of  global  signals,  in;  and  out,  correspond  to  the  fan-in  and 

i-l 

fan-out  of  tlie  switches  controlled  by  the  ith  global  rontrol  signal  and  l^j  is  the  length  of  the  wire  used  to  propagate  that 
signal. 

Several  area  estimates  were  derived  for  F’I'I’A's  using  CR,  1)1,  CFS  and  TMR  approaches,  assuming  different  values 
for  a;  a,  and  a,,  and  using  different  assumptions  about  the  geometry  of  the  layout.  These  area  estimates  and  equation 
(2..t)  were  iisei)  to  obtain  reliability  estimates  assuming  .lilferent  values  for  rji  —  Tj,,  rp,,  r,,  r„  and  r|  These  estimates  were 
then  plotted  as  film  lions  of  the  mimlier  of  spares  k  and  as  fiiiirtions  of  the  numlier  ii^  of  processors  in  the  logical  .arr.ay. 
Figures  I,  2  show  some  of  these  curves  and  a<idilioual  ones  are  reported  in*^.  Two  general  conclusions  resulted  from  these 
studies: 

(1)  as  predicted  by  the  analysis  done  in  section  2,  the  overall  reliability  of  an  FTPA  increases  as  the  number  of  spares 
increases  up  to  a  certain  number  and  then  starts  to  decrease  due  to  the  reliability  of  non-redundant  hardware;  this  is 
clearly  illustrated  in  Figure  1  where  the  reliabilities  of  the  fault-tolerant  and  fault-intolerant  areas  (Rf(  and  Rj, 
respectively)  for  an  FTPA  using  tlie  CR  approach  are  plotted  as  function  of  the  number  of  spare  columns;  in  this 
particular  example,  the  FTPA  reliability  (i.e.,  the  product  of  R^  and  R^)  shows  no  improvement  as  the  number  of 
spare  columns  is  increased  beyond  28  beeause  llj;  decreases  faster  than  R(t  increases  after  this  value: 

(2)  The  choice  of  an  FTPA  design  with  Hie  best  relialiility  depends  on  the  size  of  the  array,  on  the  number  of  spares  and 
other  technological  parameters  (rj,  r^,  a,.,  a„  etc,);  in  other  words,  FTPA’s  of  different  size  may  require  different 
design  approaches  in  order  to  achieve  m.sxinial  rcli,ibility  with  a  given  technology,  number  of  spares  and  layout 
geometry;  this  is  illustrated  in  Figure  2  where  CFS  and  1)1  approaches  may  or  may  not  provide  better  reliability 
depending  on  the  array  size. 

T  MULTI-I.FA'RL  FTPA'S 


The  basic  Idea  behind  the  design  of  a  imiHi-level  FTPA  is  best  explained  if  the  particular  case  of  a  bi-lcvel  FTPA  is 
considered  first.  A  ht-level  PTI’A  consists  of  a  fault  tolerant  array  of  FTPA's.  In  other  words,  the  full  array  is  parti¬ 
tioned  Into  siibarrays  and  can  he  thought  of  as  an  array  of  suharrays.  Doth  the  suharrays  and  the  array  of  subarrays  use 
some  fault-tolerance  scheme.  I'he  suharrays  are  hereon  .lenoted  .as  fst-levcl  ITl'A'a  and  the  array  of  siibarrays  is 
referred  to  as  the 

Snd  Uvcl  FTPA.  A  2nd-level  FTPA  ran  be  thought  of  as  an  F  PPA  where  the  basic  modules  are  themselves  FTPA's  and, 
physically,  it  is  'he  same  as  the  hi-levrl  I'TP A.  The  extension  to  multi-level  FTPA’s  easily  made  by  realizing  that  an  n- 
level  F'I'I’A  consists  of  an  F  I'PA  whose  basic  modules  are  (n-l)-level  I'TPA’s  whirh  are  l''  ri’A's  composefl  of  {n-2)-lcvel 


FTPA’s,  nlc.  For  convcnicnre  of  prcsonlntion,  l)i-lcv<'l  arrays  aro  assiiinrd  lirroon  aii<l,  unless  staler!  ollierwise,  tlie  liasir 
ideas  and  results  apply  to  multi-level  arrays  as  well. 


When  faults  occur  in  a  bi-lcve!  array,  reconfiguration  is  first  attempted  at  the  Ist-level  FTPA's,  and,  when 
reconfiguration  fails  at  the  Ist-level,  then  reconfiguration  at  the  2nd-levc!  FTPA  is  attempted.  Intnilivcly.  hi-level 
FTPA’s  can  be  expected  to  have  better  reli.ability  than  .singlc-levci  FTPA's  for  two  reasons:  (1)  the  area  of  non- 
redundant  (fault-intolerant)  hardware  in  bi-level  FI'I’A’s  is  smaller  than  that  in  single  level  FTPA's  and  (2)  the  sire  of 
arrays  an'l  the  reconfiguration  approach  used  al  each  level  c.an  he  clio.sen  so  thal  optimal  reliability  resulls,  thus  avoiding 
the  inevitable  reliability  degradation  that  ocenrs  when  (he  sire  of  single-ievel  arrays  grows  (oo  large. 

Reason  (1)  can  be  restated  in  different  terms  in  order  to  provide  additional  insight.  One  can  think  of  the  structure 
of  "extra  hardware"  (non-redundant  hardware,  e.g.,  switches  and  control  logic)  in  single-level  FTPA's  as  a  series  success 
diagram*^  where  individual  extra  hardware  elenietits  correspond  to  the  modules  of  the  diagram  (not  to  be  confused  wilb 
the  FTPA  modules.)  In  this  tyiie  of  serial  slrnclnre,  a  failure  in  a  single  module  implies  failure  of  tin  complete  system. 
On  the  other  h.arirl,  the  .structure  r»f  the  rxir.i  hardware  in  a  mtilli-level  I'  l'IW  is  such  that  the  corresponding  sneress 
diagr.am  is  a  series-parallel  diagram  with  as  many  stages  as  there  are  levels  in  the  mniti-level  F  1  PA.  (de.arly,  the  failnr<’ 
of  any  rriodule  now  restdt.s  in  tin*  failure  of  (he  ass,,ciaterl  rr'liahilily  p.alhs  bnl  nr>t  neressnrily  a  tol.al  faibirr*.  In  rela¬ 
tion  to  reason  (2)  above,  the  advantage  of  being  able  to  rhoose  the  size  and  fanll-tolerance  mctliod  used  at  each  lev*  1  is 
that  it  becomes  possible  to  use  the  best  fanit-toleranre  method  for  a  given  size  of  array  or,  given  the  rrrrtliod.  find  tin* 
array  size  which  is  the  best,  or  troth.  In  other  words,  the  depi'ndi*ney  on  array  size,  discussed  in  sictim*  It  a®  a  di  ad*  an- 
tage  of  single-level  FTF’A's,  ran  he  advantagr-mrsly  exftloifed  in  e.ncli  level  of  nmlti-level  FTPA's, 

Having  realized  the  potential  benefits  of  midti-lcvel  FTPA  s,  from  an  engineeritrg  point  of  view,  it  is  essential  to  h.arc 
a  systematic  and  formal  approach  to  the  design  these  syst'-ms  so  to  optimize  the  reliability  of  tlie  overall  FTP,\  I'ti's 
approach  is  described  in  the  remainrier  of  this  pajrer. 

5.  (iP  riMAI,  HI-LFVF,F  rrPAj.s 

For  the  purposes  of  this  section,  it  is  convcnieni,  to  refer  to  the  reliability  of  an  F  I'P.t  as  given  liy  (2.3)  as  an  explicit 
function  not  only  of  n  and  k  but  also  of  Cp  ^  (,ot  (he  size  of  (he  Ist-level  ,and  2n(l-level  F'JP.A's  he  (rij'n,)  t  k)  .lud 
(n2’rn2)  -h  ko'  respectively.  Here,  k,  and  k2  rrrrrespoml  to  the  mmiher  of  spare  processors  in  the  Ist-level  h  I  I’.A  and  the 
immirer  of  spare  Ist-level  h'l'I'A's  in  the  2iull(  vri  FTPA,  ami  can  he  ex|iress<d  as  fnnetinns  of  n,  and  n,,  rispirt  Ivi  I;. 
Also,  the  number  of  processors  in  the  hi  level  array  is  (n  li)  1  k  wliere  n  —  Ui  Uj  and  k  -*  k|  mj  (  k2  ("?  I  ^ 

ability  of  the  bi-lcvcl  FTPA  is  essentially  the  reliability  of  the  2n>l-level  F  I'l’.A,  i.e.,  U,  R  (n2.k2.ll1l  where  R,  is  (he  reli¬ 
ability  of  one  Ist-level  FTl’A,  I  e.,  R|  =  R  (n|.k|.rpl.  In  order  to  find  the  v.alues  of  n,  and  02  fi'i  which  I?;  is  otdimal  if  is 
necessary  to  solve  the  equation 

dRj  -'Its  ^  '‘It,  ^  '‘It,  iHh  Q  (-11 

rin,  '-R,  'hij  ■'k2  '‘n. 

Once  the  solution  02  is  obtained,  nj  can  he  found  liecau-e  n  is  assumed  to  hr  known.  I hifort unately,  for  even  the  ‘im- 
plest  hi-level  FTPA  design,  equation  (f>-l)  is  very  har<l  to  solve.  An  exainiile  illustrating  the  impossibility  of  this 
approach  for  a  simple  case  is  disciisserl  in  (n  order  to  deal  with  tills  problem,  a  new  a|'i>roach  is  propo'.ed  here  '.vhich 
uses  very  good  functional  a  pproximat  Imis  of  It,  and  llj  lhai  are  explained  nexl. 


The  type  of  function  chosen  to  approximate  R|  as  a  function  of  n|  is  the  well  known  Weibnil  reliaiiilily  function,  i.c. 

R,-e  (.A2) 

where  f>'  and  it  are  real  positive  constants  whose  v.alue  depends  on,  among  other  characteristics  of  ‘he  I*  PP.A,  the 
reconfiguration  method  used.  With  this  type  of  approximation  it  has  been  possible  to  approximate  with  negligible  errors 
all  the  reliability  cstim.ales  discussed  iii  section  3  that  have  been  attempted  so  far.  While  it  is  not  easy  to  show  analyti¬ 
cally  that  R]  (given  by  cqu.ation  2. .3)  c.nn  alway.s  be  ajzproximated  by  a  Weibull  fiinetion,  the  general  chararleristics  of  tiie 
curves  obtained  for  R|  and  the  experience  gained  so  far  indiealo  that  it  is  highly  likely  that  this  is  possible  in  general 


The  approximation  used  for  Rj  ss  a  function  of  n,  is  more  invfdve<l  than  that  used  for  R|  as  a  function  of  np  T  his  is 
because  R,  is  approximated  as  a  function  of  nj  alone  (the  remaining  variables  are  fnnrlions  of  n,  or  fixed)  while  Rj  must 
be  approximated  as  a  function  of  ti;  and  R|  (bec.anse,  for  tin*  2nd-lcvel  F  I  FA,  rr,~R|  F  a  function  oi  n,  —  n/n,).  F  rom 
(2.2)  and  (2.3)  it  results  that  R,  can  he  express,  '(1 

R,  s  11(11-. k;,R|l  -  R.tu-.b,)  lt„(n;.k2,R,)  ('■'.3) 


where 


Rcfiij.k-) 


A  (n  k  .t 

’’n 


(■■'- 1,1 


RiJn-.k-.R,)  -  A) 
I  n 


R,  (1  R,V  . 

M'hr  form  of  tbr  npproximation  fnnrtion  for  (r>.l)  lin«!  rliocrti  to  Ihrtl  of  n  Wribiill  rolj.'xl/ijit y  fnrKtion, 


A  «'  i  *'  A  A*  A  I 


r.i't.t** 


H/  I  \  “'  (lnr,1a*n^  -an.^  ic  m\ 

fi("2.'<2)  '  rn  =o'  =-!>  (5.B) 

where  a'  and  b  are  some  positive  real  constants,  and  a  =  (In  r(i)-a'.  From  (5.4)  and  (5.6),  it  is  clear  that  this  amounts  to 
approximating  Ag(n2k])  by  am’’,  i.e.,  the  area  of  raiilt-intolcraiit  hardware  is  assumed  to  grow  proportionally  with  some 
power  of  the  number  of  processors  in  the  array.  Such  an  approximation  may  not  sulTicc  when  Ag(n2  kj)  varies  in  a  nwre 
complicated  way.  However,  in  the  worst  case,  it  can  always  be  approximated  by  a  polynomial  expression  on  n,  in  which 
case  Rg(n2k2)  would  be  approximated  by  a  product  of  Weibull  functions.  However,  experience  shows  that  (5.6)  provides 
rather  good  approximations  and  is  assumed  hereon. 

The  approximation  used  for  Rfj(n2,k2,Ri)  as  a  function  of  112  and  R,  (the  k2  used  here  is  a  function  of  02)  is  also  in 
the  form  of  a  Weibull  function,  i.e., 

nrt(n2,k2,R,)  =  R,i(n2,R,)  ~  c  (5.7) 

where  u,  v  and  w  arc  positive  real  constants.  In  ord^r  to  justify  (5.7),  consider  the  expression  of  Rf((n2,k2,R|)  as  a  func¬ 
tion  of  time  for  fixed  02,  k2  and  R|,  i.e.,  Rft(t)  r  e  for  some  positive  real  constants  X,  6.  This  is  a  commonly  adopted 
Weibull  approxiination  to  the  reliability  of  a  sysi.ein  as  a  function  of  time.  Similarly,  consider  the  expression  of  R]  as  a 
function  of  time  as  Hi(t)  —  e  ’’  for  some  positive  real  coiiHlanls  f  .anil  </'.  The  variable  t  ran  be  expressed  as  a  funclioii 
of  Rj  and  replaced  in  the  expression  of  Rrt(t)  to  obtain  Rfj(n2,k2,Ri)  as  a  function  of  R|  as 

n,t(R,)  e  ^  (5.8) 

and,  since  Rft(n2,k2,R|)  depends  on  02  according  to  another  Weibull  function  (just  as  R,  depends  on  n|),  it  results  that 

R„(n„R,)  r^  (Rfi(R,))*  "-  (5.9) 

for  some  positive  real  constants  x  and  y.  Substituting  the  expression  (5.8)  for  R(t(R,)  in  (5.9)  and  letting  u  =  S/if>, 
V  =  X  x  'y"^^’*  and  w  =  y  yields  (5.7). 


In  summary,  the  approximation  used  for  (5..T)  is 


-  a  rr^  (-  IbRA“v  hT 
e  -e  ’ 


Since  n,  —  n/n,,  and  letting  0  --  o'  ti  ’,  (5.2)  can  be  rewritten  as 

R,  '  c  ’  ’ 

Replacing  R|  in  (5.10)  according  to  (5.1 1)  yields 


,,  -(»  nl  I  05^ 

Ixj  -•  O 

_  ^-(»n>  t  o'vn}”  ■*'") 


Equation  (5.1)  can  now  be  solved,  i.e., 

dR, 


and  because  112/0  and  l{2v'0,  - —  —  0  if 


»''  v(w  -  /?  u)-n|'*  '>]  '’"’I  0 


a  b  n,’'  -I  ■>"  v(w  -  /?  u)  nj''  -  0  (5.14) 

A  sufficient  and  necessary  condition  for  the  existence  of  a  real  positive  solution  to  (5.14)  is  that 

w  -  fl  n  <  0  (5.15) 

If  there  exist.s  no  real  positive  solution  to  (5.14)  then  there  is  no  bi-level  FTPA  (using  some  fixed  fault-tolerant  schemes  in 
each  level)  with  better  reliability  than  a  single-level  scheme  tising  one  of  the  fault-tolerant  schemes  used  in  the  1st  and 
2nd-levei  arrays.  Equation  (5.14)  can  be  rewritten  as 

a  b  n,’’  f"  +  ■."•v(w  -  ,'?  u)  =  0  (5.16) 

and  letting  d  —  w  -  ,?  u  and  ' >"  v  t?/(a  b),  the  solutiou  is 

nj  =  ..'.'/O'-*)  (5,7) 

Substituting  (5.17)  in  (5.12)  yields  the  expression  for  the  maximum  reliability  attainable  with  a  bi-level  FTPA  using  two 
given  fault-tolerance  schemes  as 

R;  =e  . '>i  (.5.18) 

In  order  to  verify  the  potential  gains  in  reliability  of  bi-level  FTPA’s  several  cases  studies  were  undertaken.  As  an 
example,  one  of  these  studies  looked  at  the  problem  of  designing  a  logical  (Sfiv.lB)  processor  array  (i.e.,  the  number 
(36-36)  does  not  include  spares).  It  was  decided  that  the  simple  C'R  approach  would  be  used  for  Ist-level  FTPA's  while 
the  I)I  scheme  would  be  used  for  the  2nd-lcve|  I  ’I'P.X  in  order  to  maximire  utiliration  of  spare  Ist-level  FTPA's.  For  both 
('ll  and  1)1  schemes  a  column  of  spare  processors  is  used.  The  area  and  reliability  estimates  were  computed  for  both  the 
f'R  and  1)1  methods  and  the  reliabilities  for  each  of  the  levels  were  approximated  as 
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Rj  p  ’  ’  (r..2n) 

The  following  parameter  values  were  used  ;  rp  =  ru  =  O.flfl,  a,  =  a^  =  1/800  and  a;  =  l/dOO.  The  values  of  the  variahirs 
in  (5.17)  arc  (1~— 11.2'!,  b~  1.26,  </>  ~  l.H'lxlo”  and  the  optimal  value  of  Oj  is  n2  =5.5  '••5.  This  indicates  that 
n|  =  36/5  =  7.2  ~  7.  Since  the  targeted  array  has  a  size  of  (36y36),  i.e.  n  36,  and  since  n  =  ni'iij  =  35,  one  can 
decide  to  build  a  slightly  smaller  array  using  the  extra  processors  as  spares  or  change  nj  =  6  and  n,  =  6  When  n,  =  7 
and  nj  =  5  the  actual  physical  array  (i.e.,  including  spares)  contains  1680  processors,  i.e.,  a  redundancy  factor  of  1.37. 
When  ni’  =  6  and  nj  =  6  there  are  176'1  processors,  i.e.,  a  redundancy  factor  of  1.36.  The  reliability  for  the  last  case  is 
0.75,  where  for  a  single-level  array  using  (he  ?)I  approach  or  the  CR  approach  uith  the  same  rerlnnrlniiry  ratio  the  relia¬ 
bilities  arc  0.08  and  0.31  respectivrdy. 

fix 

Several  important  related  conchisions  can  he  made  from  the  work  reported  in  this  i)aper.  hirst,  it  has  been  shown 
that  non-redrindant  hardware  and  extra  logic  added  for  fault-tolerant  purposes  do  limit  the  usefulness  of  single  level 
FTPA’s  above  a  certain  size.  A  second  conclusion  is  that,  h.ased  on  reliability  estimates  for  different  types  of  FI  I’A’s  for 
different  array  sizes  and  different  area  and  technology  parameters,  there  is  not  a  single  type  of  I'TPA  which  is  universally 
optimal.  In  other  words,  FTPA's  based  on  different  fault-tolerance  methods  are  optimal  for  different  array  sizes  for  a 
given  technology.  A  third  conclusion  is  (hat  multi-level  FTPA’s  do  not  suffer  from  (he  disadvantage  pointed  out  in  the 
first  conclusion  for  single-level  I'TPA's  and  can  take  advantage  of  the  fart  pointed  out  in  the  second  conclusion  -  the  in  t 
result  being  a  highly  reliable  FTP.A. 

To  achieve  the  third  conclusion  mentioiwd  above,  the  prot)lem  of  designing  optimal  bi-level  FTf’A's  was  addressed 
and  a  methorlology  for  its  solution  has  been  <lesrribe<l.  The  key  to  this  met hodidogy  is  the  realization  that,  by  using 
accurate  functional  approximations  of  the  reliability  of  FTPA’s  at  different  levels,  the  complexity  of  the  exact  analytical 
expressions  is  avoided.  These  approximations  are  based  on  Weibull  reliahility  functions. 

The  work  reported  in  this  paper  represents  significant  progress  towards  a  theory  and  solutions  for  fTF’A  design. 
However,  it  also  opens  a  very  large  nmid'er  of  <(ueslions  whicli  relate  to  liow  Imttcr  area  and  reliahility  estimates  can  he 
obtained,  variations  of  the  optimization  criteria  (e.g.  compound  measures  of  performance,  area  and  reliability)  subject  to 
other  constraints  (e.g.  fixed  redntidancy  ratio),  extension  t(>  inn|ti-|rvel  FTPA’s,  imnlenien.l  at  ion  issues,  developrncnl  of 
tools,  etc.  It  is  our  belief  that  the  framework  presented  in  lliis  paper  provides  tlm  l>a.sls  and  some  ingredients  for  a  sound 
theory  that  might  lead  to  the  sobdion  of  these  new  (iroldems.  Work  in  this  direction  is  now  in  progress. 
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Figure  1.  The  reliability  /{(-lOifc)  of  an  array  using 
the  CU-method  is  the  product  of  7?|,('10.fc)  and 
ff/j('10,fc);  it  decreases  with  k  for  k/40>28. 


Spare  Columns  (k/40) 


=rp  =0.0.5,  Oj  =n,=R.2.'ix  10 


n,=.5.0xl0 


TMIf 

fc=2»i’ 


Figure  2.  The  reliability  of  dilferent  single-level 
FTl’A’s  schemes  with  {nxn)-l-fc  processors;  different 
approaches  arc  optimal  for  different  values  of  n  and 
low  reliability  results  for  large  arrays. 
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Extended  Abstract 

This  paper  surveys  current  bit  level  processor  array  architectures  and  describes  a 
tool  for  designing  and  programming  these  arrays.  The  survey  emphasizes  arrays 
that  have  been  implemented  rather  than  proposed  architectures.  The  essential 
features  shared  by  these  arrays,  and  those  that  differentiate  them  are  characterized 
and  used  to  develop  a  taxonomy  for  bit  level  processor  arrays.  The  second  part  of 
the  paper  discusses  programming  tools,  with  an  emphasis  on  RAB,  a  large  program 
used  to  map  a  class  of  algorithms  written  in  'C'  onto  bit  level  processor  arrays.  The 
basic  components  and  extensions  to  RAB  are  discussed,  along  with  examples  which 
include  the  mapping  of  arithmetic,  numeric  and  neural  network  algorithms  onto 
processor  arrays.  Directions  for  future  research  and  design  of  bit-level  processor 
array  architectures  and  their  programming  environments  are  also  discussed. 
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I.  Introduction 

The  rapid  pace  of  innovation  in  VLSI  has  provided  an  implementation  medium  for  highly 
parallel  computer  architectures.  To  fully  realise  the  tremendous  computing  power  of  VLSI 
technology  requires  that  its  characteristics,  often  subtle  and  complex,  be  understood.  First,  the 
design  complexity  introduced  by  the  increasing  chip  sise  and  density  calls  for  architectures 
composed  of  repetitive,  modular  structures.  Second,  interconnections  between  devices  consume 
more  power  and  require  more  area  than  the  devices  (transistors)  themselves.  Third,  the 
computing  environment  offered  by  VLSI  is  I/O-bound,  not  compute>bound,  due  to  the  limited 
number  of  I/O  pins  available  in  comparison  to  logic  gates.  These  considerations  have  led  to 
the  development  of  highly  parallel  bit  level  processor  arrays,  which  are  distinguished  by  their 
communication  strategy  -  digital  signals  are  transmitted  bit  sequentially  on  single  wires  as 
opposed  to  simultaneous  transmission  on  parallel  busses  •  and  their  large  number  of  simple  1- 
bit  word  processing  elements  (PEs).  This  leads  to  efficient  communication  both  internally  and 
between  chips  and  provides  a  high  degree  of  parallelism.  The  regular,  repeatable  structures 
inherent  in  bit  level  arrays  have  the  following  advantages  over  other  structures  implemented 
using  VLSI: 

(a)  design  complexity  is  reduced,  an  important  feature  as  VLSI  densities  reach  millions 
of  transistors  per  chip; 

(b)  several  techniques  for  introducing  fault>tolerance  into  such  structures  are  available, 
and  this  aspect  is  particularly  important  in  wafer  scale  integration  implementations; 

(c)  functional  verification  of  the  design  is  simpler; 

(d)  high  packing  densities  are  possible  as  the  chip  designer  can  concentrate  on 
optimising  a  single  cell  which  is  then  repeated,  as  is  done  in  memory  circuits; 

(e)  the  arrays  are  scalable  to  higher  VLSI  densities  and  can  be  pipelined  to  the  bit  level 
to  provide  very  high  throughput  rates; 

(f)  measures  of  active  resources,  i.e.,  the  percentage  of  logic  gates  and  memory  involved 
in  a  computation,  favor  large  bit  level  arrays  and  indicate  that  they  have  the  potential 
for  very  large  throughput. 

The  advantages  from  an  algorithmic  standpoint,  resulting  from  the  flexibility  at  the  bit  level, 
include: 

(a)  using  symmetries  and  optimisations  that  are  possible  at  the  bit  level  to  reduce 
computation  time,  e.g.,  squaring  can  take  one-half  the  time  necessary  for 
multiplication; 

(b)  the  precision  used  in  the  array  can  be  matched  to  that  necessary  for  a  particular 
computation; 

(c)  some  of  the  more  flexible  processor  arrays  can  change  the  level  of  parallelism 
available  to  match  the  parallelism  contained  in  an  algorithm. 
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n.  Current  Arehiteeturee 

Bit  level  processor  arrays  are  characteriied  by  a  set  of  four  features: 

•  the  basic  building  blocks  are  simple,  modular,  and  repeatable  processing  elements 
that  can  be  placed  in  large  numbers  on  a  single  VLSI  chip 

•  computation  and  communication  are  performed  at  the  bit  level 

•  SEMD  (single  instruction  multiple  data)  control,  with  some  variations,  is  used  reduce 
the  complexity  of  control  for  these  very  large  arrays  and  amortize  the  expensive 
control  hardware  over  many  PEs 

•  the  dominating  feature  of  these  arrays  is  potentially  massive  concurrency  realized 
through  VLSI  architectures. 

Within  the  context  of  these  features,  we  will  discuss  five  classes  of  bit  level  arrays  and 
architectures  representative  of  each  class.  These  five  classes,  shown  in  Table  I  with 
representative  systems,  are: 

1)  systolic  arrays; 

2)  image  processing  arrays; 

3)  reconfigurable,  or  adaptive,  arrays; 

4)  associative  processing  arrays; 

3)  high  level  network  arrays. 

Bit  level  aystolie  arrays  have  been  developed  to  perform  several  digital  signal  processing 
tasks,  including  convolution  and  correlation,  rank-order  filtering,  and  the  Discrete  Fourier 
Transform  (DPT)  [1].  Several  of  these  chips  are  now  sold  as  commercial  products.  These 
arrays  are  specialized  to  perform  one  particular  algorithm,  and  each  processing  element  is 
optimised  for  the  particular  algorithm  being  implemented.  This  fact,  along  with  the  systolic 
concepts  of  extensive  pipelining  and  local  communication  applied  down  to  the  bit  level  yield 
extremely  fast  clock  speeds,  as  the  clock  cycle  time  is  reduced  to  that  of  the  bit  level 
processing  element. 

Image  processing  arrays  include  the  *PP  chip  [2],  Goodyear  Aerospace  MPP  [3j, 

GEC  GRID  chip  [4],  and  the  University  Co’  _•  London  CLIP  [5].  These  bit  level  arrays  are 
oriented  towards  image  processing  as  ;ir  primary  application,  and  have  special  image 
manipulation  features  in  hardware  sue  as  data  reformatting  buffers,  bit  plane  I/O,  and 
processing  elements  optimised  for  certai.  image  transformations.  PAPIA  [6]  is  a  processor 
array  using  a  pyramid  architecture  to  rea  le  multiresolution  pipelined  image  processing;  each 
PE  in  PAPIA  operates  at  the  bit  level,  and  has  connections  to  4  PEs  (sons)  on  a  lower  plane 
and  a  singe  PE  (father)  in  a  higher  plane. 
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Reeonfigurable  arrays  have  the  ability  to  adapt  to  the  different  degrees  of  concurrency 
available  for  different  algorithms  and  within  the  same  algorithm.  Bit  level  processor  arrays 
representative  of  this  class  include  the  ICL  DAP  [7],  NTT’s  AAP  chip  [8],  and  the  University 
of  Southampton  RPA  [9].  In  the  latter  two  arrays,  part  of  the  microcode  word  is  held  locally 
within  each  PE,  allowing  some  degree  of  independence  in  both  computation  and  in 
communication  with  other  PEs. 

Associative  processing  arrays  include  the  Brunei  University  SCAPE  [10]  and  the  Airborne 
Associative  Processor  (ASPRO)  [3],  a  VLSI  version  of  the  STARAN  associative  processor  [llj. 
The  salient  feature  of  these  processors  is  a  content-addressable  memory  that  is  used  to 
perform  bit  level  operations  in  parallel.  These  arrays  are  particularly  adept  at  fast  and 
efBcient  searching. 

High  level  network  arrays  have  the  same  basic  architecture  as  other  bit  level  processor 
arrays,  but  in  addition  to  a  nearest  neighbor  network  these  arrays  will  have,  for  example,  a 
hypercube,  cube  connected  cycles,  or  multistage  cube  interconnection  network.  These  high 
level  networks  provide  high  bandwidth  communication  paths  between  non-neighbor  processing 
elements,  a  requirement  for  many  algorithms.  Examples  of  this  class  include  the  Connection 
Machine  [12]  (hypercube),  the  DEC  Massively  Parallel  Architecture  [13]  (multistage  cube),  and 
the  Boolean  Vector  Machine  [14]  (cube  connected  cycles). 

Future  architectures  could  combine  the  features  of  reeonfigurable  arrays  with  high  level 
networks  to  provide  arrays  with  varying  degrees  of  parallelism  and  flexible  communication. 
This  approach  holds  the  promise  of  efficiently  mapping  a  broader  class  of  algorithms  onto 
these  arrays. 

A  more  comprehensive  survey  and  elaborate  taxonomy  will  appear  in  the  final  paper. 


Bit  Level  Processor  Arrays 

Syttolie 

- - -  -r-,  rn, 

Imftp  PrpCMtiac 

R«eoaS(vrsbl« 

AnoeistlT*  ProcMsiag 

High  Ltvtl  Nttwork 

OFT 

GAFP 

AAP 

SCAPE 

DEC  MPA 

Rsnk-Ord«r  Pilttr 

MPP 

RPA 

ASPRO 

Consactioa  Msebin* 

CoavolatioD 

Grid 

DAP 

Booltao  Vtctor  Maebio* 

CLIP 

PAPU 

C(0|{ 

iCS‘ 


-  5  - 


m.  Programming  and  Design  Tools 

Despite  the  advantages  of  bit  level  processor  arrays  in  both  VTSI  implementation  and 
algorithm  execution,  they  can  be  difficult  to  program  (in  the  case  of  an  existing  general 
purpose  architecture)  or  design  (in  the  case  of  a  special  purpose  architecture).  This  problem  is 
accentuated  by  the  need  to  implement  high  level  computations  (e.g.,  matrix  computations, 
convolution)  using  bitwise  operations.  For  the  architectures  described  previously,  several 
approaches  to  this  problem  have  been  used.  These  include: 

•  subroutine  libraries  accessed  through  a  standard  high-level  language  which  have  been 
optimized  for  a  particular  architecture 

•  parallel  languages,  which  are  often  architecture  or  machine  dependent 

•  microcoded  routines  to  handle  standard  word  level  operations  in  a  general  way, 
without  using  bit  level  symmetries  or  optimizations. 

These  approaches  lack  portability  among  different  machines  and  sometimes  ignore 
optimizations  possible  at  the  bit  level.  It  is  often  difficult  to  prove  the  optimality  of  a  given 
mapping  using  these  methods.  In  order  to  solve  the  problems  associated  with  current 
approaches,  it  is  desirable  to  develop  methodologies  and  tools  which  enable  the  systematic 
mapping  of  algorithms  onto  processor  arrays.  In  the  past,  several  research  efforts  have  been 
pursued  in  this  direction  and  a  good  survey  can  be  found  in  [15].  Many  of  these  methodologies, 
which  were  intended  for  word  level  processor  arrays,  are  applicable  to  bit  level  arrays. 
However,  besides  some  of  the  limitations  that  still  characterize  those  methodologies, 
systematic  bit  level  designs  present  additional  problems.  RAB  (Reconfiguration  Algorithm  for 
Bit  level  code),  an  automated  design  tool  which  maps  a  class  of  algorithms  programmed  in  ‘C’ 
into  bit  level  arrays,  represents  an  attempt  to  understand  and  solve  the  open  questions  and 
problems  involved  in  the  systematic  design  of  bit  level  processor  arrays. 

In  practice,  potential  users  of  processor  arrays  are  given  an  algorithm  and  must  devise 
means  for  its  execution  using  one  of  the  following  options:  (1)  use  an  existing  processor  array, 
(2)  design  a  special  purpose  processor  array,  or  (3)  design  an  array  that  uses  a  number  of 
existing  small  processor  array  modules  as  the  basic  components.  Option  (1)  requires  mapping 
the  algorithm  into  an  existing  array  taking  into  consideration  size  limitations,  fixed 
interconnection  schemes,  and  predesigned  processing  elements.  In  this  option,  which  we  refer 
to  as  full  mapping,  the  programming  decisions  are  subordinated  to  the  characteristics  of  the 
array.  Option  (2)  allows  the  user  to  design  the  hardware  taking  into  consideration  only  the 
characteristics  of  the  algorithm  and  perhaps  some  rather  general  VLSI  design  constraints  (i.e., 
planarity,  limited  pinout,  etc).  This  option  is  referred  to  as  full  design.  It  corresponds  to  the 
front  end  of  a  silicon  compiler,  and  could  provide  the  structural  and  behavioral  description  of 
an  array  to  the  placement  and  layout  tools  of  the  silicon  compiler.  Option  (3)  is  a  compromise 
between  full  mapping  and  full  design,  where  the  designer  can  decide  the  overall  organization 
(i.e.,  shape,  size,  interfaces)  of  the  array,  but  uses  given  basic  blocks  which  are  themselves  fully 
defined  "small"  processor  arrays.  We  refer  to  this  option  as  partial  mapping/design. 


The  input  to  RAB  conaista  of  C  programa  which  deacribe  word  level  algorithma.  Theae 
algorithma  correapond  to  neated  for  loopa  with  static  behavior.  RAB  first  expands  the 
computations  in  th^  input  program  into  bit  level  operations  as  shown  in  Figure  1.  This 
expansion  phase  replaces  word  level  computations  with  a  bit  level  implementation  of  the 
arithmetic  operations  using  a  library  of  macro  expansions.  This  phase  is  followed  by  data 
dependence  and  broadcast  analysis  using  the  Dependence  Arc  Set  Analysis  technique  [I6j.  The 
result  of  this  analysis  is  a  formal  description  of  the  internal  structure  of  the  bit  level 
algorithm.  This  structural  information  is  used  to  generate  an  algorithm  transformation  which 
yields  a  restructured  algorithm  suitable  for  mapping  onto  a  bit  level  processor  array.  The 
mapping  may  be  a  full  design  of  an  algorithmically  defined  array  or  full  (partial)  mapping  for 
a  fixed  (variable)  sise  array  corresponding  to  the  fourth  level  of  modules  in  Figure  1.  The 
transformed  algorithm  structure  is  then  converted  into  an  intermediate  representation  which 
can  used  to  generate  code  for  several  different  architectures.  The  last  two  modules  in  Figure 
1,  code  generation  and  code  optimisation,  comprise  the  phase  in  which  code  is  generated  from 
the  intermediate  representation  for  a  particular  target  architecture.  This  code  is  then 
optimised  using  a  standard  compaction  technique.  RAB  has  been  applied  to  numerical, 
arithmetic  and  neural  network  algorithms. 

IV.  Summary 

This  paper  describes  bit  level  processor  arrays  and  the  characteristics  that  make  them 
ideal  for  VLSI  and  high  speed  computations.  It  presents  a  taxonomy  for  current  bit  level 
arrays  that  provides  different  perspectives  on  this  class  of  architectures.  The  design  and 
programming  problem  is  addressed,  and  an  automated  design  tool,  RAB,  is  presented  as  a 
solution  to  some  of  these  problems.  RAB  has  the  twin  advantages  of  portability  among 
different  architectures  and  systematic  optimisation  techniques  down  to  the  bit  level. 
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